2. The gradient with regard to convolutional param-
eters tends to be substantially larger than that
of fully connected layers since it is a sum over
all unit environments within the convolutional
layer. In other words, convolutional parameters
are “reused” for all local environments that make
their gradient grow.
This suggests a meaningful further work: to find
some sufficiently general prototypes of networks with
convolutional layers and to investigate the perfor-
mance of alternative optimization methods on them,
including the influence of machine precision for the
second-order methods.
REFERENCES
Bermeitinger, B., Hrycej, T., and Handschuh, S. (2019).
Representational Capacity of Deep Neural Networks:
A Computing Study. In Proceedings of the 11th Inter-
national Joint Conference on Knowledge Discovery,
Knowledge Engineering and Knowledge Management
- Volume 1: KDIR, pages 532–538, Vienna, Austria.
SCITEPRESS - Science and Technology Publications.
Chollet, F. et al. (2015). Keras.
Delalleau, O. and Bengio, Y. (2011). Shallow vs. Deep
Sum-Product Networks. In Advances in Neural Infor-
mation Processing Systems, volume 24. Curran Asso-
ciates, Inc.
Fletcher, R. and Reeves, C. M. (1964). Function minimiza-
tion by conjugate gradients. The Computer Journal,
7(2):149–154.
George, T., Laurent, C., Bouthillier, X., Ballas, N., and Vin-
cent, P. (2021). Fast Approximate Natural Gradient
Descent in a Kronecker-factored Eigenbasis.
Heaton, J. (2018). Ian Goodfellow, Yoshua Bengio, and
Aaron Courville: Deep learning. Genet Program
Evolvable Mach, 19(1):305–307.
Hinton, G. (2012). Neural Networks for Machine Learning.
Martens, J. (2010). Deep learning via Hessian-free opti-
mization. In Proceedings of the 27th International
Conference on International Conference on Machine
Learning, ICML’10, pages 735–742, Madison, WI,
USA. Omnipress.
Martens, J. and Grosse, R. (2015). Optimizing Neural Net-
works with Kronecker-factored Approximate Curva-
ture.
Mont
´
ufar, G. F., Pascanu, R., Cho, K., and Bengio, Y.
(2014). On the Number of Linear Regions of Deep
Neural Networks. In Ghahramani, Z., Welling, M.,
Cortes, C., Lawrence, N. D., and Weinberger, K. Q.,
editors, Advances in Neural Information Processing
Systems 27, pages 2924–2932. Curran Associates, Inc.
Nocedal, J. and Wright, S. J. (2006). Numerical Opti-
mization. Springer Series in Operations Research.
Springer, New York, 2nd ed edition.
Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flan-
nery, B. P. (1992). Numerical Recipes in C (2nd Ed.):
The Art of Scientific Computing. Cambridge Univer-
sity Press, USA.
Ren, Y., Bahamou, A., and Goldfarb, D. (2022). Kronecker-
factored Quasi-Newton Methods for Deep Learning.
TensorFlow Developers (2022). TensorFlow. Zenodo.
Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M.,
Reddy, T., Cournapeau, D., Burovski, E., Peterson, P.,
Weckesser, W., Bright, J., van der Walt, S. J., Brett,
M., Wilson, J., Millman, K. J., Mayorov, N., Nel-
son, A. R. J., Jones, E., Kern, R., Larson, E., Carey,
C. J., Polat,
˙
I., Feng, Y., Moore, E. W., VanderPlas,
J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen,
I., Quintero, E. A., Harris, C. R., Archibald, A. M.,
Ribeiro, A. H., Pedregosa, F., van Mulbregt, P., and
SciPy 1.0 Contributors (2020). SciPy 1.0: Fundamen-
tal algorithms for scientific computing in python. Na-
ture Methods, 17:261–272.
Wolfe, P. (1969). Convergence Conditions for Ascent Meth-
ods. SIAM Rev., 11(2):226–235.
KDIR 2022 - 14th International Conference on Knowledge Discovery and Information Retrieval
314