selection should consider weight variance as parame-
ter. If the number of data points is reasonably small,
NNK-ELM can thus result in considerable time sav-
ings when doing model selection. A factor adding to
this is that, unlike ELM, NNK-ELM gives determin-
istic results, and only requires repetitions if variability
due to training data is considered.
NNK-ELM can also naturally deal with non-
standard data. NNK corresponds to an infinite net-
work with error function sigmoids in hidden units.
If a Gaussian kernel was used instead, the compu-
tation would imitate an infinite radial basis function
network. Dropping the neural network interpretation,
any positive semidefinite matrix can be used. This
leaves us with the idea of using a kernel for nonlin-
ear mapping, then returning to a vectorial represen-
tation of points, and applying a classical algorithm
(as opposed to an inner-product formulation of algo-
rithms needed for kernel methods). In the case of
NNK-ELM, the algorithm is a simple linear regres-
sion, but the same idea could be used with arbitrary al-
gorithms. This can serve as a way of applying classi-
cal, difficult-to-kernelize algorithms to non-standard
data (like graphs or strings), for which kernels are de-
fined.
4.4 Future Directions
When ELM is considered as an approximation of an
infinite network, it becomes obvious that the vari-
ance of hidden layer weights is more important than
the weights themselves. It should undergo rigorous
model selection, as any other parameter. Also lessons
already learned from other neural network architec-
tures, like the effect of weight variance on the oper-
ating point of sigmoids, should be kept in mind when
determining future directions for ELM development.
Questions about correct number and behavior of
hidden units in ELM remain open. If the hidden units
do not learn anything, what is their meaning in the
network? Do they have a role besides increasing the
variance of output?
If we fix the data x and draw weights w
i
(includ-
ing the bias) randomly and independently, then also
the hidden layer outputs a
i
= f(w
i
x) are independent
random variables. They are combined into model out-
put as b =
∑
H
i=1
β
i
a
i
.
Variance of b is related to number and variance
of a
i
. This is seen by remembering that, for in-
dependent random variables F and G, Var[F+ G] =
Var[F] + Var[G]. The more hidden units we use, the
larger the variance of the model output.
Training of the output layer has opposite effect on
variance. Output weights are not random, so vari-
ance of the model output b is related to that of a
i
by rule Var[cF] = c
2
Var[F] (where c is a constant).
That is, variance of b is formed as a weighted sum of
variances of a
i
. The weights β
i
’s are chosen to have
minimal norm. Although minimizing the norm does
not guarantee minimal variance, minimum norm esti-
mators partially minimize the variance as well (Rao,
1972). Therefore, choice of output weights tends to
cancel the variance-increasing effect of hidden units.
We have recognized the importance of variance,
yet the roles and interactions of weight variance, num-
ber of hidden units (increases variance) and determi-
nation of output weights (decreases variance) are not
clear, at least to the authors. If we are to understand
how and why ELM works, the role of variance needs
further study.
REFERENCES
Asuncion, A. and Newman, D. (2007). UCI machine learn-
ing repository.
Bartlett, P. L. (1998). The sample complexity of pattern
classification with neural networks: the size of the
weights is more important than the size of the net-
work. IEEE Transactions on Information Theory,
44(2):525–536.
Cho, Y. and Saul, L. K. (2009). Kernel methods for deep
learning. In Bengio, Y., Schuurmans, D., Lafferty, J.,
Williams, C., and Culotta, A., editors, Proc. of NIPS,
volume 22, pages 342–350.
Fr´enay, B. and Verleysen, M. (2010). Using SVMs with
randomised feature spaces: an extreme learning ap-
proach. In Proc. of ESANN, pages 315–320.
Golub, G. H. and Van Loan, C. F. (1996). Matrix computa-
tions. The Johns Hopkins University Press.
Guyon, I., Gunn, S. R., Ben-Hur, A., and Dror, G. (2004).
Result analysis of the nips 2003 feature selection chal-
lenge. In Proc. of NIPS.
Huang, G.-B., Zhu, Q.-Y., and Siew, C.-K. (2006). Extreme
learning machine: Theory and applications. Neuro-
computing, 70:489–501.
Miche, Y., Sorjamaa, A., Bas, P., Simula, O., Jutten, C., and
Lendasse, A. (2010). OP-ELM: Optimally pruned ex-
treme learning machine. IEEE Transactions on Neural
Networks, 21(1):158–162.
Minka, T. (2001). Expectation propagation for approximate
bayesian inference. In Proc. of UAI.
Rao, C. R. (1972). Estimation of variance and covariance
components in linear model. Journal of the American
Statistical Association, 67(337):112–115.
Rao, C. R. and Mitra, S. K. (1972). Generalized Inverse of
Matrices and Its Applications. Wiley.
Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian
processes for machine learning. MIT Press.
KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval
72