INTERPRETING EXTREME LEARNING MACHINE AS AN

APPROXIMATION TO AN INFINITE NEURAL NETWORK

Elina Parviainen, Jaakko Riihim¨aki

Dept. of Biomedical Engineering and Computational Science (BECS)

Aalto University School of Science and Technology, Aalto, Finland

Yoan Miche, Amaury Lendasse

Dept. of Information and Computer Science (ICS)

Aalto University School of Science and Technology, Aalto, Finland

Keywords:

Extreme learning machine (ELM), Neural network kernel.

Abstract:

Extreme Learning Machine (ELM) is a neural network architecture in which hidden layer weights are ran-

domly chosen and output layer weights determined analytically. We interpret ELM as an approximation to

a network with inﬁnite number of hidden units. The operation of the inﬁnite network is captured by neural

network kernel (NNK). We compare ELM and NNK both as part of a kernel method and in neural network

context. Insights gained from this analysis lead us to strongly recommend model selection also on the variance

of ELM hidden layer weights, and not only on the number of hidden units, as is usually done with ELM. We

also discuss some properties of ELM, which may have been too strongly interpreted in previous works.

1 INTRODUCTION

Extreme Learning Machine (Huang et al., 2006)

(ELM) is a currently popular neural network architec-

ture based on random projections. It has one hidden

layer with random weights, and an output layer whose

weights are determined analytically. Both training

and prediction are fast compared with many other

nonlinear methods.

In this work we point out that ELM, although in-

troduced as a fast method for training a neural net-

work, is in some sense closer to a kernel method in its

operation. A fully trained neural network has learned

a mapping such that the weights contain information

about the training data. ELM uses a ﬁxed mapping

from data to feature space. This is similar to a ker-

nel method, except that instead of some theoretically

derived kernel, the mapping ELM uses is random.

Therefore, the individual weights of the ELM hidden

layer have little meaning, and essential information

about the weights is captured by their variance.

This thought is met also in derivation of the neu-

ral network kernel (NNK) (Williams, 1998), which is

widely used in Gaussian process prediction. The net-

work in the derivation has inﬁnite number of hidden

units, and when the weights are integrated out, the re-

sulting function is parameterized in terms of weight

variance.

We interpret ELM as an approximation to this in-

ﬁnite neural network. This leads us to study con-

nections between NNK and ELM from two opposite

points of view.

On the one hand, we can use ELM hidden layer

to compute a kernel, and use it in any kernel method.

This idea has been suggested for Support Vector Ma-

chines (SVM) in (Fr´enay and Verleysen, 2010), which

has been the main inspiration for our work. We try out

the same idea in Gaussian process classiﬁcation.

On the other hand, we can use NNK to replace the

hidden layer computations in ELM. This is done by

ﬁrst computing a similarity-based representation for

data points using NNK, and then deriving a possible

set of explicit feature space vectors by matrix decom-

position. This corresponds to using ELM with an in-

ﬁnite number of hidden units.

Our experiments show that the theoretically de-

rived NNK can replace ELM when ELM is used as

a kernel. NNK can also perform computations of

the ELM hidden layer, albeit at higher computational

cost. An inﬁnite network performing equally well or

often better than ELM raises a question about mean-

ingfulness of choosing model complexity based on

Parviainen E., Riihimäki J., Miche Y. and Lendasse A..

INTERPRETING EXTREME LEARNING MACHINE AS AN APPROXIMATION TO AN INFINITE NEURAL NETWORK.

DOI: 10.5220/0003071100650073

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2010), pages 65-73

ISBN: 978-989-8425-28-7

 2010 SCITEPRESS (Science and Technology Publications, Lda.)

hidden units only, as is traditionally done with ELM.

We argue, and support the argument by experiments,

that the variance of hidden layer weights is an im-

portant tuning parameter of ELM. Model selection of

weight variance should therefore be considered.

We ﬁrst present in Section 2 some tools (ELM,

NNK, data sets) which we will need in subsequent

sections. Section 3 discusses similarities and differ-

ences between fully trained neural networks, ELM,

and kernel methods, and reports the experiments with

ELM kernel and computing ELM hidden layer results

using NNK. Results of the experiments and properties

of ELM are commented in length in Section 4.

2 METHODS

2.1 Extreme Learning Machine (ELM)

The following, including Algorithm 1, is an abridged

and slightly modiﬁed version of ELM introduction in

(Miche et al., 2010).

The ELM algorithm was originally proposed

in (Huang et al., 2006) and it makes use of the Sin-

gle Layer Feedforward Neural Network (SLFN). The

main concept behind the ELM lies in the random

choice of the SLFN hidden layer weights and biases.

The output weights are determined analytically, thus

the network is obtained with very few steps and with

low computational cost.

Consider a set of N distinct samples (x

) with

∈ R

and y

∈ R

; then, a SLFN with H hidden

units is modeled as the following sum

∑

i=1

f(w

+ b

), j ∈ [1, N], (1)

with f being the activation function, w

the input

weights, b

the biases and β

the output weights.

In the case where the SLFN perfectly approxi-

mates the data, the errors between the estimated out-

puts

and the actual outputs y

are zero and the rela-

tion is

∑

i=1

f(w

+ b

) = y

, j ∈ [1, N], (2)

which writes compactly as Hβ = Y, with

H =







f(w

+ b

) ··· f(w

+ b

)

f(w

+ b

) ··· f(w

+ b

)







(3)

and β = (β

...β

)

and Y = (y

...y

)

Theorem 2.1 of (Huang et al., 2006) states that

with randomly initialized input weights and biases for

the SLFN, and under the condition that the activation

function is inﬁnitely differentiable, then the hidden

layer output matrix can be determined and will pro-

vide an approximation of the target values as good as

wished (non-zero training error).

The way to calculate the output weights β from the

knowledge of the hidden layer output matrix H and

target values, is proposed with the use of a Moore-

Penrose generalized inverse of the matrix H, denoted

as H

†

(Rao and Mitra, 1972). Overall, the ELM algo-

rithm is summarized as Algorithm 1.

Algorithm 1: ELM.

Given a training set (x

),x

∈ R

, an ac-

tivation function f : R 7→ R and the number of hidden

nodes H.

1: - Randomly assign input weights w

and biases

, i ∈ [1, H];

2: - Calculate the hidden layer output matrix H;

3: - Calculate output weights matrix β = H

†

Number of hidden units is an important parameter

for ELM, and should be chosen with care. The se-

lection can be done for example by cross-validation,

information criteria or starting with a large number

and pruning the network (Miche et al., 2010).

2.2 Neural Network Kernel

Neural network kernel is derived in (Williams, 1998)

by letting the number of hidden units go to inﬁnity. A

Gaussian prior is set to hidden layer weights, which

are then integrated out. The only parameters remain-

ing after the integration are variances for weights.

This leads to an analytical expression for expected co-

variance between two feature space vectors,

) =

sin

−1

2˜x

Σ˜x

(1+ 2˜x

Σ˜x

)(1+ 2˜x

Σ˜x

)

(4)

Above, ˜x

= [1 x

] is an augmented input vector and

Σ is a diagonal matrix with variances of inputs. In

this work all variances are assumed equal, for closer

correspondence with ELM.

NNK also arises as a special case of a more gen-

eral arc-cosine kernel (Cho and Saul, 2009). NNK is

not to be confused with tanh-kernel tanh((x

·x

)+ b),

which is often also called MLP kernel (from Multi-

Layer Perceptron).

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

2.3 Data Sets and Experiments

We perform two experiments. In ”GP experiment”

ELM kernel is used in a Gaussian Process classi-

ﬁer (Rasmussen and Williams, 2006). Results as

function of H are compared with results of neural

network kernel. In ”ELM experiment” ELM clas-

siﬁcation accuracy (as function of number of hid-

den units) is compared to accuracy got by replac-

ing ELM hidden layer with NNK. The latter variant

of ELM is from now on referred to as NNK-ELM.

The comparison is done for ﬁve different variances

(σ ∈ {0.1,0.325,0.55, 0.775, 1}).

The GP experiment is implemented by modifying

gpstuff

toolbox

. Expectation propagation (Minka,

2001) is used for Gaussian process inference. The

ELM experiment uses the authors’ matlab implemen-

tation.

We use six data sets from UCI machine learning

repository (Asuncion and Newman, 2007). They are

described in Table 1. For representative results, the

data sets were chosen to have different sample sizes

and different dimensionalities.

Table 1: Data sets used in the experiments.

name # samples # dims data types

Arcene

†

200 10000 cont.

US votes 435 16 bin.

WDBC 569 30 cont.

Pima 768 8 cont.

Tic Tac Toe 958 27 categ.

Internet ads 2359 1558 cont., bin.

†

(Guyon et al., 2004)

For ELM experiment the data is scaled to range

[−1,1]. Each data set is divided into 10 parts. Nine

parts are used for training and one for testing, repeat-

ing this 10 times. This variation from data is shown in

ﬁgures. ELM results have another source of variation,

the random weights. This is handled by repeating the

runs 10 times, each time drawing random weights,

and averaging over results. Maximum number of hid-

den units for ELM experiment is 250, to make sure to

cover the sensible operating range of ELM (up to N

hidden units) for all data sets.

In the GP experiment, we are more interested in

ELM behavior as the function of hidden units than the

prediction accuracy. We therefore only consider vari-

ation from random weights, drawing 30 repetitions of

random weights. The data is split into train and test

sets (50 % / 50 %). Zero-mean unit-variance normal-

ization is used for the data.

http://www.lce.hut.ﬁ/research/mm/gpstuff/

ELM places few restrictions to the activation of

hidden units. In this work we use the error function

Erf(z) = 2/

√

exp−t

, since it is the sigmoid used

in the derivation of NNK. For the same reason we use

Gaussian distribution for weights, instead of the more

usual uniform. ELM only requires the distribution to

be continuous.

3 ANALYSIS OF ELM

Essential property of a fully trained neural network is

its ability to learn features on data. Features extracted

by the network should be good for predicting the tar-

get variable of a classiﬁcation/regression task.

In a network with one hidden and one output layer,

the hidden layer learns the features, while the out-

put layer learns a linear mapping. We can think

of this as ﬁrst non-linearly mapping the data into a

feature space and then performing a linear regres-

sion/classiﬁcation in that space.

ELM has no feature learning ability. It projects the

input data into whatever feature space the randomly

chosen weights happen to specify, and learns a lin-

ear mapping in that space. Parameters affecting the

feature space representation of a data point are type

and number of neurons, and the variance of hidden

layer weights. Training data can affect these parame-

ters through model selection, but not directly through

any training procedure.

This is similar to what a support vector machine

does. A feature space representation for a data point

is derived, using a kernel function with a few param-

eters, which are typically chosen by some model se-

lection routine. Features are not learned from data,

but dictated by the kernel. Weights for linear classi-

ﬁcation or regression are then learned in the feature

space. The biggest difference is that where ELM ex-

plicitly generates the feature space vectors, in SVM or

another kernel method only similarities between fea-

ture space vectors are used.

3.1 ELM Kernel

Authors of (Fr´enay and Verleysen, 2010) propose us-

ing ELM hidden layer to form a kernel to be used in

SVM classiﬁcation. They deﬁne ELM kernel function

ELM

) =

f(x

) · f(x

), (5)

that is, the data is fed trough the ELM hidden layer to

obtain the feature space vectors, and their covariance

is then computed and scaled by the number of hidden

units.

INTERPRETING EXTREME LEARNING MACHINE AS AN APPROXIMATION TO AN INFINITE NEURAL

NETWORK

When the number of hidden units grows, this ker-

nel matrix approaches that given by NNK. Figure 1

shows the approach, measured by Frobenius norm, as

function of H. Especially for small H the ELM kernel

varies due to random weights, but it clearly converges

towards NNK.

0 1000 2000 3000 4000 5000 6000

100

#hidden

||K

ELM

−K

NNK

Figure 1: ELM kernel (K

ELM

) approaches neural network

kernel (K

) in Frobenius norm when number of hidden

unit grows. Mean by black dots, 95% interval by shading.

Variation is caused by randomness in weights. WDBC data

set.

Results of GP experiment are shown in Fig-

ure 2. ELM kernel behavior in GP seems qualita-

tively similar to that observed in (Fr´enay and Verley-

sen, 2010) for SVM. The classiﬁcation accuracy ﬁrst

rises rapidly and then sets as a ﬁxed level. Variation

due to random weights remains, but NNK result stays

inside the 95% interval of ELM.

3.2 NNK Replacing ELM Hidden Layer

3.2.1 Derivation

When using ELM, we only deal with vectorial data,

with data space vectors transformed into feature space

vectors by the hidden layer. Kernel methods rely on

pairwise data, where only similarities from any point

to all training points are considered. Kernel matrix

speciﬁes the pairwise similarities. In order to use pair-

wise information from the NNK instead of ELM hid-

den layer, we must ﬁnd a vectorial representation for

the data.

The idea of recovering points given their mutual

relationships is old (Young and Householder, 1938),

and is the basis of multidimensional scaling (Torg-

erson, 1952). Multidimensional scaling is used in

psychometry for handling results of pairwise compar-

isons, and more generally as a dimension reduction

method. When the data arises from real pairwise com-

parisons, like in psychometry, there is no guarantee

of structure of the similarity matrix. In our case, on

the contrary, the structure is known: NKK is derived

as covariance, and is therefore positive semideﬁnite

(PSD).

Any PSD matrix can be decomposed into a matrix

500 1000 1500

#hidden

correct %

TicTacToe

ELM−kernel/GP

NN−kernel/GP

500 1000 1500

100

#hidden

correct %

USvote

500 1000 1500

#hidden

correct %

Arsene

500 1000 1500

100

#hidden

correct %

WDBC

500 1000 1500

#hidden

correct %

Pima

500 1000 1500

100

#hidden

correct %

Internet Ad

Figure 2: GP classiﬁcation accuracy when using ELM ker-

nel (black dots and shading) versus NKK accuracy (hori-

zontal line).

and its Hermitian conjugate

C = LL

. (6)

There are different methods for ﬁnding the factors

(Golub and Van Loan, 1996). Matlab

cholcov

im-

plements a method based on eigendecomposition. If

we take C in (6) to be output of the NNK function

(4), then L can be thought as one possible set of cor-

responding feature space vectors.

We use L to determine the output layer weights

the same way we used H in ELM,

β = L

†

Y. (7)

The factors L are unique only up to a unitary trans-

formation, but this is not a problem in ELM context,

as the linear ﬁtting of output weights is able to adapt

to linear transformations.

With inﬁnite number of hidden units, the feature

space is inﬁnite-dimensional. Meanwhile, the data

we have available is ﬁnite, and the n data points can

span at most n-dimensional subspace. Thus the max-

imum size of L is n ×n; the number of columns can

be smaller if the data has linear dependencies.

If C is positive deﬁnite or close to it, a triangu-

lar L could be found using Cholesky decomposition,

leading to fast and stable matrix operations when ﬁnd-

ing the output layer weights. As positive deﬁniteness

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

cannot be guaranteed, we use the more general de-

composition for PSD matrices in all cases.

The one remaining problem is the mapping of test

points to the feature space. In ELM, the test data is

simply fed through the hidden layer. In our case, the

hidden layer does not physically exist, and we must

base the calculations on similarities fom test points to

training point, as given by NNK (4). This means that

NNK output for test data C

∗

is covariance matrix of

the form

∗

= LL

∗

. (8)

We already know the pseudoinverse of L. Therefore

∗

is recovered from

∗

= (L

†

∗

)

= (L

†

∗

)

, (9)

and the predictions for test targets are computed as

∗

= L

∗

β. (10)

3.2.2 ELM Experiment

NNK-ELM results are shown in 3, as function of σ.

Results for two data sets are clearly affected by the

variance parameter, others are less sensitive.

0.1 0.3 0.6 0.8 1

100

correct %

Arsene

0.1 0.3 0.6 0.8 1

100

correct %

USvote

0.1 0.3 0.6 0.8 1

100

correct %

WDBC

0.1 0.3 0.6 0.8 1

100

correct %

Pima

0.1 0.3 0.6 0.8 1

100

correct %

TicTacToe

0.1 0.3 0.6 0.8 1

100

correct %

Internet Ad

Figure 3: NNK-ELM results, mean and 95 % interval due

to data variation.

Predictions given by ordinary ELM are shown in

in Figure 4, and mean values for NNK-ELM predic-

tions are included for comparison. We notice that

when the variance is properly chosen, using NNK

gives equal or better results than ELM for most data

sets. Pima data set is an exception, ELM has some

predictive power whereas NNK-ELM performs al-

most at level of guessing.

We also notice that the choice of variance has a

marked effect on two and some effect on other data

sets, both for ordinary ELM and the NNK variant. In

the following section we will look at variance effects

in in more detail and discuss the reasons for impor-

tance of variance.

3.3 Variance of Weights

Importance of variance, or more often the range used

for uniform distribution, is regarded important in

ELM works (e.g. (Miche et al., 2010)). However,

it is not seen as a model parameter, but simply a con-

stant which must be suitably ﬁxed to guarantee that

the sigmoid operation neither remains linear nor too

strongly saturates to ±1.

Variance effects from Figure 4 are summarized in

Figures 5 and 6.

Figure 4 shows the mean predictions of ELM as

function of H. For TicTacToe and WDBC data sets

the predictions are clearly affected by the variance

parameter. For Internet ad data the overall effect of

both H and σ is very small. In that scale, the small-

est variance nonetheless gives results clearly different

than larger values. Results for other data sets are not

very sensitive to the variance values that were tried.

For TicTacToe and Internet ads smaller variance gives

better predictions, for WDBC the biggest one. Clearly

no ﬁxed sigma can be used for all data sets.

Variance also affects the uncertainty of ELM pre-

dictions, as witnessed by Figure 6, where the width

of smoothed 95 % intervals of predictions is shown.

WDBC data set exhibit the strongest effect, followed

by Pima data. All data sets show some effect of vari-

ance. Number of hidden units interacts with effects

of σ, but no general pattern appears. Also direction

of the effect remains unspeciﬁed; for four data sets

smaller σ gives larger uncertainty, but the opposite is

true for the rest.

When thinking about the mechanism by which the

variance parameter affects the results, differences be-

tween data sets are to be expected. Variance affects

model complexity, and obviously different models ﬁt

different data sets. Variance and distribution of the

data together determine the magnitude of values seen

by the activation function. The operating point of the

sigmoidal activation determines the ﬂexibility of the

model.

When weights are small, the sigmoid produces

a nearly linear mapping. Large weights result in

a highly non-linear mapping. This is illustrated in

Figure 7. One-dimensional data points, spread over

range [-1,1] (the x-axis), are given random weights

drawn from zero-mean Gaussian distribution and then

fed through an error function sigmoid, repeating this

10000 times. Mean output and 95 % interval are de-

picted. On average, the sigmoid produces a zero re-

sponse, but the distribution of responses is determined

by the variance used. Small variance means mostly

small weights, and linear operation. Large variance

produces many large weights, which increase the pro-

INTERPRETING EXTREME LEARNING MACHINE AS AN APPROXIMATION TO AN INFINITE NEURAL

NETWORK

50 100 150 200 250

100

σ=0.1

Arsene

50 100 150 200 250

100

USvote

50 100 150 200 250

100

WDBC

50 100 150 200 250

100

Pima

50 100 150 200 250

100

TicTacToe

50 100 150 200 250

100

Internet Ad

50 100 150 200 250

100

σ=0.325

50 100 150 200 250

100

50 100 150 200 250

100

50 100 150 200 250

100

50 100 150 200 250

100

50 100 150 200 250

100

50 100 150 200 250

100

σ=0.55

50 100 150 200 250

100

50 100 150 200 250

100

50 100 150 200 250

100

50 100 150 200 250

100

50 100 150 200 250

100

50 100 150 200 250

100

σ=0.775

50 100 150 200 250

100

50 100 150 200 250

100

50 100 150 200 250

100

50 100 150 200 250

100

50 100 150 200 250

100

50 100 150 200 250

100

#hidden

σ=1

50 100 150 200 250

100

#hidden

50 100 150 200 250

100

#hidden

50 100 150 200 250

100

#hidden

50 100 150 200 250

100

#hidden

50 100 150 200 250

100

#hidden

Figure 4: ELM results (mean as black dots, 95 % interval as shading) for different values of σ. Mean of NNK results

(horizontal (red) line) are shown for comparison.

50100150200250

0.05

0.055

0.06

0.065

0.07

Arsene

#hidden

correct %, mean

50100150200250

0.86

0.88

0.9

0.92

0.94

USvote

#hidden

correct %, mean

50100150200250

0.5

0.6

0.7

0.8

WDBC

#hidden

correct %, mean

50100150200250

0.6

0.65

0.7

Pima

#hidden

correct %, mean

50100150200250

0.7

0.8

0.9

TicTacToe

#hidden

correct %, mean

50100150200250

0.084

0.086

0.088

0.09

0.092

0.094

Internet Ad

#hidden

correct %, mean

Figure 5: Effect of variance on mean of ELM predictions.

Darker shade indicates smaller variance.

portion of large responses by the network, allowing

nonlinear mappings.

Considering the effect of variance on complexity

of the model, we think that variance should undergo

model selection just as the number of hidden units

does.

4 DISCUSSION

4.1 On Properties of ELM

Authors of (Huang et al., 2006) promote ELM by

speed, dependence on a single tuning parameter,

small training error and good generalization perfor-

mance. These claims have often been repeated by

subsequent authors, but we have not come upon much

discussion of them. Here we present some comments

on these properties.

Training of a single ELM network is fast, provided

the number of hidden units is small. Speed of training

as the whole, however, depends also on the number of

training runs. Model selection may require consider-

able number of repetitions.

First factor to consider is the inherent randomness

of ELM results. If average performance of ELM is to

be assessed, any runs must be repeated several times,

adding to the computational burden.

Complexity of model selection is determined by

the number of tuning parameters, since all sensible

combinations should be considered.

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

50 100150200250

0.5

0.55

0.6

0.65

#hidden

correct %, 95%−width

Arsene

50 100150200250

0.06

0.08

0.1

0.12

#hidden

correct %, 95%−width

USvote

50 100150200250

0.1

0.2

0.3

#hidden

correct %, 95%−width

WDBC

50 100150200250

0.12

0.14

0.16

0.18

#hidden

correct %, 95%−width

Pima

50 100150200250

0.04

0.06

0.08

0.1

0.12

0.14

#hidden

correct %, 95%−width

TicTacToe

50 100150200250

0.84

0.86

0.88

0.9

0.92

0.94

#hidden

correct %, 95%−width

Internet Ad

Figure 6: Effect of variance on width of the 95 % interval of

ELM predictions. Darker shade indicates smaller variance.

−1 0 1

−1

data

erf output

σ = 0.1

−1 0 1

−1

data

erf output

σ = 0.325

−1 0 1

−1

data

erf output

σ = 0.55

−1 0 1

−1

data

erf output

σ = 0.775

−1 0 1

−1

data

erf output

σ = 1

Figure 7: Distributions of predictions of an error function

sigmoid for different σ.

First parameter is the number of hidden units. The

only theoretically motivated upper limit for the num-

ber of hidden units to try is N (which is enough for

zero training error). At that limit, computing pseu-

doinverse corresponds to ordinary inversion of an

N ×N matrix, with a complexity of O(N

). In prac-

tice, smaller upper limits are used.

Traditionally, somewhat arbitrary ﬁxed values

have been used for weight variance, and model selec-

tion has only considered the number of hidden units.

Our results show that also the weight variance has a

noticeable effect on results, and should thus be con-

sidered a tuning parameter.

Generally, small training error and good general-

ization may be contradictory goals. ELM is proved

(Huang et al., 2006) to be able to reproduce the train-

ing data exactly if the number of hidden units equals

or exceeds the number of data points. This behavior,

though important in proving computational power of

ELM, is usually not desirable in modeling. A model

should generalize, not exactly memorize the training

data. This view is indirectly acknowledged in practi-

cal ELM work, where the number of hidden units is

much smaller than N. This may preventELM network

from overﬁtting to the training data, a factor usually

not discussed in ELM literature.

Generalization ability of ELM is attributed to the

fact that computing output layer weights by pseudoin-

verse achieves a minimum norm solution. Generaliza-

tion ability of a neural network is in (Bartlett, 1998)

shown to relate to small norm of weights. However,

Bartlett’s work considers the neural network as whole,

not only the output layer. Although ELM minimizes

the norm of output layer weights, norm of the hidden

layer weights depends on the variance parameter, and

does not change in ELM training.

In the hidden layer, generalization ability is re-

lated to the operating point of hidden unit activations,

discussed in Section 3.3. A model with small hidden

layer weights is nearly linear, and generalizes well. A

highly non-linear model, produced by large weights,

is more prone to overﬁtting. Therefore, conclusions

about generalization ability of ELM should not be

based on the output weights only.

4.2 ELM as a Kernel

Use of ELM as a kernel, at least in a Gaussian pro-

cess classiﬁer, is likely to remain a curiosity. Clas-

siﬁcation performance seems to steadily increase as

the number of hidden units grows, and, when consid-

ering the variation caused by ELM randomness, the

performance does not exceed that of NNK. When an

easy-to-compute, theoretically derived NNK function

is available, we see no reason to favor a heuristical

kernel, computation of which requires generating and

storing randomnumbers and an explicitnonlinear fea-

ture space mapping.

4.3 NNK as ELM Hidden Layer

We introduced NNK-ELM as a way for studying the

effect of inﬁnite hidden units in ELM, but it can also

ﬁnd its use as a practical method.

Computational complexity of NNK-ELM corre-

sponds to that of N-hidden unit ELM. The matrix de-

composition required in NNK-ELM scales as O(N

ELM runs much faster than that, since the optimal

number of hidden units is usually much less than N.

However, the optimal value is usually found by model

selection, necessitating several ELM runs. If the se-

lection procedure also considers large ELM networks,

choosing ELM over NNK-ELM does not necessar-

ily save time. Prediction, on the other hand, is much

faster with ELM, if H ≪ N.

Furthermore, NNK-ELM has only one tuning pa-

rameter, the variance of hidden layer weights. Oppos-

ing the popular view, we argue that also ELM model

INTERPRETING EXTREME LEARNING MACHINE AS AN APPROXIMATION TO AN INFINITE NEURAL

NETWORK

selection should consider weight variance as parame-

ter. If the number of data points is reasonably small,

NNK-ELM can thus result in considerable time sav-

ings when doing model selection. A factor adding to

this is that, unlike ELM, NNK-ELM gives determin-

istic results, and only requires repetitions if variability

due to training data is considered.

NNK-ELM can also naturally deal with non-

standard data. NNK corresponds to an inﬁnite net-

work with error function sigmoids in hidden units.

If a Gaussian kernel was used instead, the compu-

tation would imitate an inﬁnite radial basis function

network. Dropping the neural network interpretation,

any positive semideﬁnite matrix can be used. This

leaves us with the idea of using a kernel for nonlin-

ear mapping, then returning to a vectorial represen-

tation of points, and applying a classical algorithm

(as opposed to an inner-product formulation of algo-

rithms needed for kernel methods). In the case of

NNK-ELM, the algorithm is a simple linear regres-

sion, but the same idea could be used with arbitrary al-

gorithms. This can serve as a way of applying classi-

cal, difﬁcult-to-kernelize algorithms to non-standard

data (like graphs or strings), for which kernels are de-

ﬁned.

4.4 Future Directions

When ELM is considered as an approximation of an

inﬁnite network, it becomes obvious that the vari-

ance of hidden layer weights is more important than

the weights themselves. It should undergo rigorous

model selection, as any other parameter. Also lessons

already learned from other neural network architec-

tures, like the effect of weight variance on the oper-

ating point of sigmoids, should be kept in mind when

determining future directions for ELM development.

Questions about correct number and behavior of

hidden units in ELM remain open. If the hidden units

do not learn anything, what is their meaning in the

network? Do they have a role besides increasing the

variance of output?

If we ﬁx the data x and draw weights w

(includ-

ing the bias) randomly and independently, then also

the hidden layer outputs a

= f(w

x) are independent

random variables. They are combined into model out-

put as b =

∑

i=1

Variance of b is related to number and variance

of a

. This is seen by remembering that, for in-

dependent random variables F and G, Var[F+ G] =

Var[F] + Var[G]. The more hidden units we use, the

larger the variance of the model output.

Training of the output layer has opposite effect on

variance. Output weights are not random, so vari-

ance of the model output b is related to that of a

by rule Var[cF] = c

Var[F] (where c is a constant).

That is, variance of b is formed as a weighted sum of

variances of a

. The weights β

’s are chosen to have

minimal norm. Although minimizing the norm does

not guarantee minimal variance, minimum norm esti-

mators partially minimize the variance as well (Rao,

1972). Therefore, choice of output weights tends to

cancel the variance-increasing effect of hidden units.

We have recognized the importance of variance,

yet the roles and interactions of weight variance, num-

ber of hidden units (increases variance) and determi-

nation of output weights (decreases variance) are not

clear, at least to the authors. If we are to understand

how and why ELM works, the role of variance needs

further study.

REFERENCES

Asuncion, A. and Newman, D. (2007). UCI machine learn-

ing repository.

Bartlett, P. L. (1998). The sample complexity of pattern

classiﬁcation with neural networks: the size of the

weights is more important than the size of the net-

work. IEEE Transactions on Information Theory,

44(2):525–536.

Cho, Y. and Saul, L. K. (2009). Kernel methods for deep

learning. In Bengio, Y., Schuurmans, D., Lafferty, J.,

Williams, C., and Culotta, A., editors, Proc. of NIPS,

volume 22, pages 342–350.

Fr´enay, B. and Verleysen, M. (2010). Using SVMs with

randomised feature spaces: an extreme learning ap-

proach. In Proc. of ESANN, pages 315–320.

Golub, G. H. and Van Loan, C. F. (1996). Matrix computa-

tions. The Johns Hopkins University Press.

Guyon, I., Gunn, S. R., Ben-Hur, A., and Dror, G. (2004).

Result analysis of the nips 2003 feature selection chal-

lenge. In Proc. of NIPS.

Huang, G.-B., Zhu, Q.-Y., and Siew, C.-K. (2006). Extreme

learning machine: Theory and applications. Neuro-

computing, 70:489–501.

Miche, Y., Sorjamaa, A., Bas, P., Simula, O., Jutten, C., and

Lendasse, A. (2010). OP-ELM: Optimally pruned ex-

treme learning machine. IEEE Transactions on Neural

Networks, 21(1):158–162.

Minka, T. (2001). Expectation propagation for approximate

bayesian inference. In Proc. of UAI.

Rao, C. R. (1972). Estimation of variance and covariance

components in linear model. Journal of the American

Statistical Association, 67(337):112–115.

Rao, C. R. and Mitra, S. K. (1972). Generalized Inverse of

Matrices and Its Applications. Wiley.

Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian

processes for machine learning. MIT Press.

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

Torgerson, W. S. (1952). Multidimensional scaling: I. the-

ory and method. Psychometrika, 17(4):401–419.

Williams, C. K. I. (1998). Computation with inﬁnite neural

networks. Neural Computation, 10:1203–1216.

Young, G. and Householder, A. S. (1938). Discussion of a

set of points in terms of their mutual distances. Psy-

chometrika, 3(1):19–22.

INTERPRETING EXTREME LEARNING MACHINE AS AN APPROXIMATION TO AN INFINITE NEURAL

NETWORK