ties may have to do with the machine precision nec-
essary for stopping rules (testing for zero gradient) or
with the number of iterations available.
So, the conjugate gradient provides a theoreti-
cal guarantee for finding a minimum for an exactly
quadratic problem of dimension q in q steps. This
is a huge number of iterations for our test problems
(and other real-world ones). In addition to this, our
problems are far from being exactly quadratic (they
may even be non-convex), which further increases
the computing requirements. This makes clear that
the adequacy of every optimization method decreases
with the problem size. This still does not explain why
deep networks should be more favorable if the opti-
mization method is not adequate to the problem—at
best, it may be argued that the search is then close to
the random search, which might be indifferent to the
functional parametrization optimized.
REFERENCES
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,
Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin,
M., Ghemawat, S., Goodfellow, I., Harp, A., Irving,
G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kud-
lur, M., Levenberg, J., Man
´
e, D., Monga, R., Moore,
S., Murray, D., Olah, C., Schuster, M., Shlens, J.,
Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Van-
houcke, V., Vasudevan, V., Vi
´
egas, F., Vinyals, O.,
Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and
Zheng, X. (2015). TensorFlow: Large-Scale Machine
Learning on Heterogeneous Systems. Software avail-
able from https://tensorflow.org.
Bengio, Y., Paiement, J.-f., Vincent, P., Delalleau, O., Roux,
N. L., and Ouimet, M. (2004). Out-of-sample exten-
sions for lle, isomap, mds, eigenmaps, and spectral
clustering. In Advances in Neural Information Pro-
cessing Systems, pages 177–184.
Bermeitinger, B., Freitas, A., Donig, S., and Handschuh, S.
(2016). Object Classification in images of Neoclas-
sical Furniture using Deep Learning. In Bozic, B.,
Mendel-Gleason, G., Debruyne, C., and O’Sullivan,
D., editors, Computational History and Data-Driven
Humanities, pages 109–112, Cham. Springer Interna-
tional Publishing.
Bermeitinger, B., Hrycej, T., and Handschuh, S. (2019).
Singular Value Decomposition and Neural Networks.
In Artificial Neural Networks and Machine Learning
– ICANN 2019, Cham. Springer International Publish-
ing. In Press.
Chollet, F. et al. (2015). Keras. Software available from
https://keras.io.
Erhan, D., Manzagol, P.-A., Bengio, Y., Bengio, S., and
Vincent, P. (2009). The difficulty of training deep ar-
chitectures and the effect of unsupervised pre-training.
In Artificial Intelligence and Statistics, pages 153–
160.
Goodfellow, I. J., Bulatov, Y., Ibarz, J., Arnoud, S., and
Shet, V. (2013). Multi-digit number recognition from
street view imagery using deep convolutional neural
networks. arXiv preprint arXiv:1312.6082.
Jones, E., Oliphant, T., Peterson, P., and others (2001).
SciPy: Open source scientific tools for python.
LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learn-
ing. Nature, 521(7553):436–444.
Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y.
(2014). On the Number of Linear Regions of Deep
Neural Networks. In Ghahramani, Z., Welling, M.,
Cortes, C., Lawrence, N. D., and Weinberger, K. Q.,
editors, Advances in Neural Information Processing
Systems 27, pages 2924–2932. Curran Associates, Inc.
Press, W. H., Flannery, B. P., Teukolsky, S. A., Vetterling,
W. T., et al. (1989). Numerical Recipes, volume 2.
Cambridge university press Cambridge.
Wolfe, P. (1969). Convergence conditions for ascent meth-
ods. SIAM review, 11(2):226–235.
Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning
Rate Method. arXiv:1212.5701 [cs].
KDIR 2019 - 11th International Conference on Knowledge Discovery and Information Retrieval
538