Representational Capacity of Deep Neural Networks: A Computing

Study

Bernhard Bermeitinger

1,2 a

, Tomas Hrycej

and Siegfried Handschuh

1,2

Institute for Computer Science, University of St.Gallen, St.Gallen, Switzerland

Faculty for Computer Science, University Passau, Passau, Germany

Keywords:

Deep Learning, Artiﬁcial Neural Networks, Optimization, Conjugate Gradient.

Abstract:

There is some theoretical evidence that deep neural networks with multiple hidden layers have a potential for

more efﬁcient representation of multidimensional mappings than shallow networks with a single hidden layer.

The question is whether it is possible to exploit this theoretical advantage for ﬁnding such representations with

help of numerical training methods. Tests using prototypical problems with a known mean square minimum

did not conﬁrm this hypothesis. Minima found with the help of deep networks have always been worse than

those found using shallow networks. This does not directly contradict the theoretical ﬁndings—it is possible

that the superior representational capacity of deep networks is genuine while ﬁnding the mean square minimum

of such deep networks is a substantially harder problem than with shallow ones.

1 INTRODUCTION

At present, there is a strong revival of interest in lay-

ered neural networks. Although the basic structures

and algorithms are similar to those developed in the

eighties of the past century, there are some shifts in

the focus. The most important of them is the empha-

sis on large networks with more than one hidden layer.

The current interest wave is motivated by numer-

ous reports about positive computing experience with

such multi-layer networks for very large mapping

tasks. Typical applications are corpus-based seman-

tics and computer vision (e.g. (Bermeitinger et al.,

2016)). In particular the former is characterized by

a strong under-determination of model parameters—

there are substantially more parameters than labeled

training examples. There are some review works on

deep learning, summarizing both the success stories

and the theoretical justiﬁcations for the success. This

paper takes the extensive work by Lecun et al. (LeCun

et al., 2015) as a frequent reference.

The term deep networks will be used in the sense

of deep learning, denoting networks with more than

one hidden layer. Networks with a single hidden layer

will be referred to as shallow networks.

Besides some experimental ﬁndings of represen-

tational efﬁciency of deep neural networks, there

https://orcid.org/0000-0002-2524-1850

are also several attempts for theoretical justiﬁcations.

They state essentially the following: deep networks

exhibit larger representation capacity than shallow

networks. That is, they are capable of approximating

a broader class of functions with the same number of

parameters. (Montufar et al., 2014)

To compare the representation capacity of deep

and shallow networks, several interesting results have

been published. Bengio et al. (Bengio et al., 2004)

have investigated a class of algebraic functions that

can be represented by a special structure of deep net-

works (alternating summation and product layers).

They showed that for a certain type of deep networks

the number of hidden units necessary for a shallow

representation would grow exponentially while in a

deep network it grows only polynomial. Montu-

far et al. (Montufar et al., 2014) have used a differ-

ent approach for evaluating the representational ca-

pacity. They investigated how many different linear

hyperplane regions of input space can be mapped to

an output unit. They derived statements for the max-

imum number of such hyperplanes in deep networks,

showing that this number grows exponentially with

the number of hidden layers. The activation units used

have been rectiﬁed linear units and softmax units. It

is common to these ﬁndings that they do not make

statements about arbitrary functions. The result of

(Bengio et al., 2004) is valid for algebraic functions

representable by the given deep network architecture,

532

Bermeitinger, B., Hrycej, T. and Handschuh, S.

Representational Capacity of Deep Neural Networks: A Computing Study.

DOI: 10.5220/0008364305320538

In Proceedings of the 11th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2019), pages 532-538

ISBN: 978-989-758-382-7

but not for arbitrary algebraic terms. (Montufar et al.,

2014) have derived the maximum number of repre-

sentable hyperplanes, but it is not guaranteed, that

this maximum can be attained for an arbitrary func-

tion to be represented. In other words, there are func-

tion classes that can be efﬁciently represented by a

deep network, while other functions cannot. This is

not unexpected: knowing that some N

dimensional

function is to be identiﬁed from N

training examples

(which is equivalent to satisfying N = N

×N

equa-

tions), it cannot be generally represented by less than

N parameters although cases representable by fewer

parameters exist. Another familiar analogy is that of

algebraic terms. Some of them can be made compact

by the distributive law, others cannot.

This ﬁnding can be summarized in the following

way. There exist mappings that can be represented

by deep neural networks more economically than by

shallow ones. These mappings are characterized by

multiple usages of intermediary (or hidden) concepts.

This may be typical for cognitive mappings, so that

deep networks may be adequate for cognitive tasks.

Various studies are claiming the superiority of

deep networks based on positive computing expe-

rience. Most of them concern particular architec-

tures such as networks using convolutional layers

(e.g., (Goodfellow et al., 2013)). This network type

has a strong justiﬁcation for image recognition since

the convolutional layers are closely related to spacial

operators known to be important for image process-

ing. It is then logical to expect that a network with

an appropriate number of such layers is superior to a

shallow network that offers only the possibility of us-

ing a single convolutional layer followed directly by

the ﬁnal processing to the output.

A few works address the issue of the representa-

tional capacity of fully connected deep networks. (Er-

han et al., 2009) is aware of problems with conver-

gence properties of optimization algorithms for deep

networks, but claim their superiority to shallow net-

works on test sets. Their particular focus is to show

the usefulness of unsupervised pre-training of some

layers, rather than the comparison of the represen-

tational capacity so that the choice of test problems

makes it not fully appropriate for clarifying the repre-

sentational capacity of deep and shallow networks.

• Their study compares networks with different to-

tal numbers of network parameters (weights and

biases) so that the comparison is favorable for

deeper networks having more parameters.

• First-order optimization methods with ﬁxed learn-

ing rates are used so that the danger of inﬂuencing

the results by a poor convergence of the optimiza-

tion method is serious.

• Some problems are under-determined, that is,

having fewer constraints than parameters. In this

case, the minimum of the training set error may be

expected to be zero not only for a single optimum

solution but also for large subsets of the parameter

space.

For an objective review of deep and shallow networks,

it is important to:

• Use sufﬁciently determined problems (i.e., having

more constraints than parameters),

• Use optimization methods that can be expected to

reliably converge at least to local minima,

• Compare shallow and deep networks with iden-

tical or nearly identical numbers of free network

parameters

A comparison following these principles is the goal

of this study.

2 ATTAINABLE

REPRESENTATIONAL

CAPACITY

Even if some class of functions has a superior rep-

resentational capacity, it is not guaranteed that this

capacity can be fully exploited—the additional nec-

essary component is an algorithm that is capable of

attaining the functional ﬁt.

In terms of shallow and deep networks, it would

be necessary to have algorithms that exploit the as-

sumed superior capacity of deep networks in ﬁtting

them to a set of training examples in an efﬁcient way.

This efﬁciency would have to be sufﬁcient not to lose

the representational advantage. In practical terms, the

usefulness of a network type consists of both a repre-

sentational capacity and the algorithm efﬁciency. So,

a high capacity potential and a poorly converging al-

gorithm may result in low exploitable capacity.

In the concrete case of mean square minimization,

the numerical efﬁciency of the optimizing algorithm

for a given problem is the key parameter. For strictly

convex minimization tasks, the usual measure of the

potential efﬁciency is the condition number of the

Hessian matrix (i.e., the matrix of the second deriva-

tives of the error function with regard to the network

parameters W , with W

being the parameters of the i-th

hidden layer). This condition number is deﬁned as the

ratio of the largest and the smallest eigenvalue of the

Hessian. Unfortunately, this objective measure can-

not be applied for our comparison, for two reasons:

• The error function is not convex at points far away

from the minimum.

Representational Capacity of Deep Neural Networks: A Computing Study

533

• Even at the points where the error function is con-

vex, some eigenvalues are very close to zero, cor-

responding to search directions which have no or

very small effect on the error function.

The condition number is then near to inﬁnity.

These directions result from redundancies inher-

ent to the neural networks. This is the case, for

example, due to the rotational invariance of the

hidden layers at the points of nearly linear activa-

tion. Then, the mapping W

i+1

is identical with

i+1

−1

, for any non-singular square matrix

H of the dimension corresponding to the width

of the i-th hidden layer. So, the neural network

constitutes the same mapping with the matrices

and W

i+1

as with HW

and W

i+1

−1

For the lack of theoretical alternatives, the only possi-

bility is to assess the efﬁciency of particular problems

experimentally. To make reliable conclusions from

the experiments, it is important to use reliable opti-

mization methods whose results are as little as pos-

sible subject to random disturbances of the solution

path. This is why a widespread numerical optimiza-

tion procedure with well-deﬁned convergence prop-

erties, the conjugate gradient method, has been used

in addition to the optimization methods usual in the

neural network community.

The problems were generated so that they have a

known MSE minimum equal to zero (see Sect. 3), at-

tainable by a particular network architecture (shallow

or deep). Since it is hardly possible to generate over-

determined problems that have a zero minimum for

both a shallow and a deep network, cross-validation

has been used. A pair of zero minimum problems

have been generated, one (P

) with a zero minimum

for shallow network N

, and another (P

) with a zero

minimum for deep network N

. Both network archi-

tectures have the same dimensions of input and out-

put vectors, and hidden layer sizes such that the over-

all numbers of network parameters are close to each

other (the completely identical parameter numbers are

difﬁcult to reach).

Both problems, P

and P

, are ﬁtted by shallow

network N

and deep network N

. Comparing the

minima reached allows conjectures about the attain-

able representational capacity of shallow and deep

networks.

The scope of this study is only fully connected

networks. There seems to be no doubt that speciﬁc

deep architectures such as those using convolution

networks (mimicking spatial operators in image pro-

cessing) are optimal for speciﬁc problems. So, a com-

parison of deep and shallow networks with such spe-

cial architectures would make sense only in such spe-

ciﬁc application settings.

3 TEST PROBLEMS WITH A

KNOWN MINIMUM

To assess the performance of training algorithms, it

is desirable to use test problems with a known op-

timum. A construction method is presented in this

section. Many statements about the performance of

deep learning are based on computing experience

with practical problems from various application do-

mains. In the very most cases, the problem is ﬁtting

the deep network to real data. This implies that the

real minimum is not known—the size of real prob-

lems makes it impossible to ﬁgure out. This may dis-

tort the performance evaluation. Reaching “accept-

able” or even “the best known” results from the ap-

plication problem view lets the unanswered question

how far we are from the real optimum. It cannot be

excluded that all methods ﬁnd solutions far away from

the optimum and the statements about the algorithms

are to a large degree arbitrary. To clarify this aspect,

this investigation will make use of problems with a

known minimum. Such problems can be generated in

the following way:

A network with a set of arbitrary (e.g., random)

weights w

is generated. Furthermore, a set of random

input vectors U for a mapping to be ﬁtted is produced.

These inputs, applied to the network f (u, w), result in

a set of output vectors Y .

= f (u

, w

), i = 1, . . . , n (1)

The pairs

, y

), i = 1, . . . , n (2)

constitute the training set to be ﬁtted. The least-

squares objective function

E =

∑

i=1

− f (u

, w))

(3)

has an obvious minimum of zero. This minimum may

not be unique in terms of the parameter vector w. So,

the success of the ﬁtting is measured only via a min-

imum value of E reached. The random weights are

generated from a uniform distribution:

∈



−w

√

n + 1

√

n + 1



, i = 1, . . . , n (4)

with n being the number of unit inputs from the pre-

ceding layer. There is a predeﬁned factor w

for

controlling the degree of saturation within the net-

work. The division by the square root of n + 1 has

the goal of reaching an identical standard deviation of

the weighted sum (including the bias) going as input

to the nonlinear unit.

KDIR 2019 - 11th International Conference on Knowledge Discovery and Information Retrieval

534

Table 1: Overview of test problems.

Problem Input Output Data size # hidden layers

# nodes

(per hidden layer)

# parameters # constraints

100 50 80 1 20 3070 4000

100 50 80 3 16 3010 4000

100 50 80 5 14 3004 4000

300 150 240 1 60 27210 36000

300 150 240 3 49 27149 36000

300 150 240 5 43 27111 36000

1000 500 800 1 200 300700 400000

1000 500 800 3 164 300784 400000

1000 500 800 5 144 300164 400000

4 OPTIMIZATION METHODS

Neural networks were optimized by several methods

implemented in the popular framework Keras (Chol-

let et al., 2015) with the TensorFlow backend

(Abadi

et al., 2015): Stochastic Gradient Descent (SGD) and

RMSprop were selected because of their widespread

use, as well as Adadelta (Zeiler, 2012).

These methods are ﬁrst-order and there is a wide-

spread opinion in the neural network community

that second-order methods are not superior to the

ﬁrst-order ones. However, there are strong theoreti-

cal and empirical arguments in favor of the second-

order methods from numerical mathematics (see,

e.g. (Bermeitinger et al., 2019)). To make sure that

the results in favor of shallow or deep networks are

not biased by deﬁciencies of the optimization meth-

ods used, second-order methods should not be ne-

glected. So, the Conjugate Gradient method (CG,

see, e.g. (Press et al., 1989)), as implemented in SciPy

(Jones et al., 2001), has also been included. This im-

plementation uses the line search method based on

the step length conditions of Wolfe (Wolfe, 1969). It

exploits the derivative information and has excellent

convergence properties for smooth functions.

Since the Keras-to-SciPy-to-Keras interface re-

quires a custom-built bridge for information ﬂow be-

tween the frameworks, fast GPU-enhanced execution

is not possible and the run-time can’t be evaluated

comparably.

The performance of the optimization methods

has been compared by the number of gradient calls

(epochs in Keras). All these methods have been used

with Keras’ and SciPy’s default settings.

The actual version was 2.0.0-beta0.

5 COMPUTING RESULTS

A series of computing experiments have been car-

ried out to assess the relationship between attainable

representational capacities of shallow and deep net-

works. The comparison is by using identical mapping

problems (deﬁned by input/output pairs) and observ-

ing the errors of both network architectures. To pro-

vide a reasonable meaning to the mean square ﬁgures

attained and to make the results comparable, all prob-

lems have been deliberately deﬁned to have a mini-

mum at zero, according to the scheme of Section 3.

To justify the use of the hidden layer as a feature

extractor, its width should be smaller than the mini-

mum of the input and output sizes. The dimensions

have been chosen so that the full regression is under-

determined (as typical for the application class men-

tioned above), but the relatively narrow hidden layer

makes it slightly over-determined. So the effect of

overﬁtting, harmful for generalization, is excluded.

5.1 Data Generation

Three problem sizes denoted as A, B, and C have been

used. These classes are characterized by their input

and output dimensions as well as by the size of the

training set. For every class, a shallow network with a

single hidden layer and two deep networks with three

and ﬁve hidden layers have been generated. The prob-

lem of size class X ∈

{

A, B, C

}

with i hidden layers is

denoted by X

. The concrete network sizes, parame-

ter numbers, and numbers of constraints are given in

Table 1. The numbers of constraints are imposed by

the reference outputs to be ﬁtted. It is the product of

the output dimension and the training set size. Com-

paring the number of constraints with the number of

parameters deﬁnes the extent of over-determination

Representational Capacity of Deep Neural Networks: A Computing Study

535

or under-determination of the problem (e.g., a prob-

lem with more constraints than parameters is over-

determined).

Hidden layer units are symmetric sigmoid functions

rescaled to have a unity derivative at x = 0, deﬁned by

s(x) =

1 + e

−x

(5)

and

f (x) = 2s (2x) −1 =

1 + e

−2x

−1 (6)

For every individual network architecture, ﬁfteen dif-

ferent random parametrizations with corresponding

training sets have been generated, all with a known

mean square error minimum of zero.

5.2 Optimization Results

The results for the optimization methods for prob-

lem size class B are given in Table 2. The focus

of this table is on showing the performance on shal-

low and deep networks for all optimization methods.

Besides to the optimum of the error function F

opt

the error value at the initial random parameter set is

shown to illustrate the extent of the error improve-

ment. The number of iterations is ﬁxed to 2000 for

Keras-methods while it is determined by the stopping

rule for the conjugate gradient. However, the maxi-

mum number of iterations for the conjugate gradient

is also set to 2000.

The ﬁrst three blocks of the table show the opti-

mization results for shallow and deep networks indi-

vidually. Each network is optimized to ﬁt the training

set generated for this network architecture. The error

optimum is known to be zero under of the principles

of Section 3. These optima set the baseline for those

reached by the cross-checks in the following rows.

The following four blocks represent the cross-

check itself. The column Network denotes the net-

work architecture used for the applied optimization.

The column Source points to the architecture for

which the training set has been generated, with a

known error optimum of zero. For the training runs

presented in these rows, the optimum is not known. It

is only known that it would be zero for the architec-

ture given in the column Source, but not necessarily

also for the architecture of the column Network, used

for the ﬁtting.

For example, for the ﬁrst of these sixteen rows, the

network architecture trained is B

(i.e., a shallow net

with a single hidden layer). It is optimized to ﬁt the

training set for which it is known that a zero error can

be reached by the architecture B

(i.e., a deep net with

three hidden layers).

The column Ratio to CG displays how many times

the error function value attained by the Keras methods

was higher than that reached by the conjugate gradi-

ent.

The column Deep/Shallow shows the ratio of the

following error function values for the deep network

and the shallow network.

Additionally, Table 3 shows mean square min-

ima for all size classes using the Keras optimization

method RMSprop. This table elucidates the devel-

opment of the performance (MSE) with shallow and

deep networks for varying network sizes. Each row

shows the performance of a pair of a shallow and a

deep network with a comparable number of param-

eters. The average performance of a shallow net-

work for a problem for which a zero error minimum is

known to be attainable by a deep network is given in

the column Data deep – NN shallow. The average per-

formance of a deep network for a problem for which

a zero error minimum is known to be attainable by a

shallow network is given in the column Data shallow

– NN deep. The ratio of both average performances is

shown in the column Ratio Deep/Shallow.

The following can be observed:

• The mean square error minima (MSE) attained

by shallow networks for problems having a zero

MSE for some deep network are essentially lower

than in the opposite situation.

• The difference tends to slightly decrease with the

problem size.

• The by far weakest method was SGD, while the

best was the conjugate gradient (CG). Adadelta

and RMSprop were performing between them

both, with RMSprop sometimes approaching the

CG performance.

• The difference between the performance with a

shallow network on one hand and deep network,

on the other hand, grows with the performance

of the optimization method: the difference is rel-

atively small for the worst-performing SGD and

very large for the best-performing CG.

6 DISCUSSION

The computing experiments seem to essentially show

the superiority of shallow networks in attaining low

mean square minima for given mapping problems.

This is not necessarily a contradiction to the theoret-

ical results expecting the contrary. It is still possible

that the representational capacity of deep networks is

superior, while it is difﬁcult to exploit this capacity

KDIR 2019 - 11th International Conference on Knowledge Discovery and Information Retrieval

536

Table 2: Detailed results for problems of size B.

Network Source Algorithm # iterations F

init

×10

−3

opt

×10

−3

Ratio to CG Deep/Shallow

Adadelta 2000 332.2 6.748 578.43 –

RMSprop 2000 332.2 0.098 8.40 –

SGD 2000 332.2 90.402 7748.77 –

CG 821 332.2 0.012 – –

Adadelta 2000 143.3 6.446 173.67 –

RMSprop 2000 143.3 0.243 6.54 –

SGD 2000 143.3 50.485 1360.22 –

CG 2200 143.3 0.037 – –

Adadelta 2000 83.8 4.915 41.04 –

RMSprop 2000 83.8 0.277 2.31 –

SGD 2000 83.8 33.233 277.51 –

CG 1490 83.8 0.120 – –

Adadelta 2000 235.8 2.577 70.41 –

RMSprop 2000 235.8 0.075 2.04 –

SGD 2000 235.8 45.396 1240.02 –

CG 420 235.8 0.037 – –

Adadelta 2000 206.7 1.594 52.44 –

RMSprop 2000 206.7 0.070 2.29 –

SGD 2000 206.7 32.404 1065.83 –

CG 333 206.7 0.030 – –

Adadelta 2000 237.8 28.331 6.75 11.0

RMSprop 2000 237.8 4.415 1.05 59.2

SGD 2000 237.8 118.896 28.34 2.6

CG 1072 237.8 4.195 – 114.6

Adadelta 2000 208.6 44.980 5.32 28.2

RMSprop 2000 208.6 9.629 1.14 138.1

SGD 2000 208.6 136.118 16.11 4.2

CG 2125 208.6 8.451 – 278.0

Table 3: Results for all given problems and their ratio between shallow and deep networks.

Shallow setting Deep setting

Data deep –

NN shallow (×10

−3

)

Data shallow –

NN deep (×10

−3

)

Ratio

Deep/Shallow

0.038 5.368 140.5

0.035 11.353 320.5

0.075 4.415 59.2

0.070 9.629 138.1

0.182 4.663 25.7

0.148 9.777 66.1

by ﬁtting the mapping with the help of numerical al-

gorithms.

Shallow networks have been superior for all test

problems and all optimizing algorithms. However, it

is interesting to observe that the gap, although always

large, was relatively smaller for weakly performing

optimization methods (SGD and Adadelta) as well as

for large networks.

A possible hypothesis explaining the both might

be that the gap is low if the optimizing method fails

to search for the minimum efﬁciently, approaching the

performance of some kind of random search. This can

result either from the weakness of the method itself or

from the difﬁculty of the problem. Even sophisticated

methods such as the conjugate gradient have growing

difﬁculties with growing problem size. These difﬁcul-

Representational Capacity of Deep Neural Networks: A Computing Study

537

ties may have to do with the machine precision nec-

essary for stopping rules (testing for zero gradient) or

with the number of iterations available.

So, the conjugate gradient provides a theoreti-

cal guarantee for ﬁnding a minimum for an exactly

quadratic problem of dimension q in q steps. This

is a huge number of iterations for our test problems

(and other real-world ones). In addition to this, our

problems are far from being exactly quadratic (they

may even be non-convex), which further increases

the computing requirements. This makes clear that

the adequacy of every optimization method decreases

with the problem size. This still does not explain why

deep networks should be more favorable if the opti-

mization method is not adequate to the problem—at

best, it may be argued that the search is then close to

the random search, which might be indifferent to the

functional parametrization optimized.

REFERENCES

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,

Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin,

M., Ghemawat, S., Goodfellow, I., Harp, A., Irving,

G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kud-

lur, M., Levenberg, J., Man

e, D., Monga, R., Moore,

S., Murray, D., Olah, C., Schuster, M., Shlens, J.,

Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Van-

houcke, V., Vasudevan, V., Vi

egas, F., Vinyals, O.,

Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and

Zheng, X. (2015). TensorFlow: Large-Scale Machine

Learning on Heterogeneous Systems. Software avail-

able from https://tensorﬂow.org.

Bengio, Y., Paiement, J.-f., Vincent, P., Delalleau, O., Roux,

N. L., and Ouimet, M. (2004). Out-of-sample exten-

sions for lle, isomap, mds, eigenmaps, and spectral

clustering. In Advances in Neural Information Pro-

cessing Systems, pages 177–184.

Bermeitinger, B., Freitas, A., Donig, S., and Handschuh, S.

(2016). Object Classiﬁcation in images of Neoclas-

sical Furniture using Deep Learning. In Bozic, B.,

Mendel-Gleason, G., Debruyne, C., and O’Sullivan,

D., editors, Computational History and Data-Driven

Humanities, pages 109–112, Cham. Springer Interna-

tional Publishing.

Bermeitinger, B., Hrycej, T., and Handschuh, S. (2019).

Singular Value Decomposition and Neural Networks.

In Artiﬁcial Neural Networks and Machine Learning

– ICANN 2019, Cham. Springer International Publish-

ing. In Press.

Chollet, F. et al. (2015). Keras. Software available from

https://keras.io.

Erhan, D., Manzagol, P.-A., Bengio, Y., Bengio, S., and

Vincent, P. (2009). The difﬁculty of training deep ar-

chitectures and the effect of unsupervised pre-training.

In Artiﬁcial Intelligence and Statistics, pages 153–

160.

Goodfellow, I. J., Bulatov, Y., Ibarz, J., Arnoud, S., and

Shet, V. (2013). Multi-digit number recognition from

street view imagery using deep convolutional neural

networks. arXiv preprint arXiv:1312.6082.

Jones, E., Oliphant, T., Peterson, P., and others (2001).

SciPy: Open source scientiﬁc tools for python.

LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learn-

ing. Nature, 521(7553):436–444.

Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y.

(2014). On the Number of Linear Regions of Deep

Neural Networks. In Ghahramani, Z., Welling, M.,

Cortes, C., Lawrence, N. D., and Weinberger, K. Q.,

editors, Advances in Neural Information Processing

Systems 27, pages 2924–2932. Curran Associates, Inc.

Press, W. H., Flannery, B. P., Teukolsky, S. A., Vetterling,

W. T., et al. (1989). Numerical Recipes, volume 2.

Cambridge university press Cambridge.

Wolfe, P. (1969). Convergence conditions for ascent meth-

ods. SIAM review, 11(2):226–235.

Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning

Rate Method. arXiv:1212.5701 [cs].

KDIR 2019 - 11th International Conference on Knowledge Discovery and Information Retrieval

538