discontinuities at the iterations 20, 40, 60, etc. This
can be explained by the fact that the error function
changes at these points. The error function of a DNN
ultimately depends on the presented sets of training
samples. While this is not a new insight, the views
that are provided by the enhanced training error curve
can help to study these effects in greater detail. E.g.
we can now look at how varying the mini-batch size
influences the shape of the functions ϕ
k
as regards
their change in steepness, smoothness, etc. We can
look at the same experiment but at the level of line
searches. Interesting new features are the number of
loose ends, the number line search iterations needed
on average etc. Another experiment is to vary K, the
number of iterations per CG descent per mini-batch,
or even some ratio of K and the mini-batch size. In
any scenario, our approach does not confront us with
views that we are unfamiliar with. Instead, we get an
extension of widespread monitoring tool and we can
access its novel features as needed.
4 CONCLUSION
In this paper, we presented a novel enhanced type of
the training error curve that consists of three levels
of detail. We showed how these levels of detail can
be organized as proposed by the Visual Information
Seeking Mantra (Shneiderman, 1996): The overview
stage covers the details of a traditional training error
versus training iteration view. At the same time, it is
marked by new graphical elements that can be easily
accessed through zooming and filtering. These new
elements reflect how the training error varies as we
move from solution to solution in a gradient descent
iteration. The details-on-demand view allows for an
exploration of the methods that determine how far to
move along a given descent direction.
We were able to give an example that covered all
levels of detail accessible through the novel training
error curve. Guided by Sheiderman’s Mantra, it was
possible to introduce our visualization approach in a
way that can be seen as an interactive workflow. The
design and development of an interactive tool for the
visual exploration of the novel training error curve is
one of our future goals. To this end, we identified a
list of other gradient descent methods that we aim to
include in our tool (see Table 1).
As a final step, we gave a first impression of a hy-
perparameter analysis that can be realized with our
visualization approach. In the course of this, we
hinted at further possibilities to gain new insights as
regards the mini-batch training of DNNs. While the
concrete experimental setup did not reveal entirely
new insights about mini-batch training approaches,
we pointed out that our way of visualizing gradient
descent-related quantities is especially useful since
users can access lower level details as needed and by
using a monitoring tool that feels familiar.
REFERENCES
Becker, M., Lippel, J., and Stuhlsatz, A. (2017). Regu-
larized nonlinear discriminant analysis - an approach
to robust dimensionality reduction for data visualiza-
tion. In Proceedings of the 12th International Joint
Conference on Computer Vision, Imaging and Com-
puter Graphics Theory and Applications - Volume
3: IVAPP, (VISIGRAPP 2017), pages 116–127. IN-
STICC, SciTePress.
Dozat, T. (2016). Incorporating nesterov momentum into
adam. ICLR Workshop Submission.
Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive sub-
gradient methods for online learning and stochastic
optimization. J. Mach. Learn. Res., 12:2121–2159.
Hohman, F., Kahng, M., Pienta, R., and Chau, D. H. (2018).
Visual analytics in deep learning: An interrogative
survey for the next frontiers. IEEE transactions on
visualization and computer graphics.
Igel, C. and H
¨
usken, M. (2003). Empirical evaluation of
the improved rprop learning algorithms. Neurocom-
puting, 50:105–123.
Kingma, D. P. and Ba, J. (2014). Adam: A method for
stochastic optimization. CoRR, abs/1412.6980.
Polak, E. (2012). Optimization: Algorithms and Consis-
tent Approximations. Applied Mathematical Sciences.
Springer New York.
Polak, E. and Ribi
`
ere, G. (1969). Note sur la conver-
gence de m
´
ethodes de directions conjugu
´
ees. ESAIM:
Mathematical Modelling and Numerical Analysis -
Mod
´
elisation Math
´
ematique et Analyse Num
´
erique,
3(R1):35–43.
Reddi, S. J., Kale, S., and Kumar, S. (2018). On the conver-
gence of adam and beyond. In International Confer-
ence on Learning Representations.
Riedmiller, M. (1994). Advanced supervised learning
in multi-layer perceptrons from backpropagation to
adaptive learning algorithms. Computer Standards &
Interfaces, 16(3):265 – 278.
Rosenbrock, H. H. (1960). An automatic method for finding
the greatest or least value of a function. The Computer
Journal, 3(3):175–184.
Shneiderman, B. (1996). The eyes have it: A task by data
type taxonomy for information visualizations. In Pro-
ceedings of the 1996 IEEE Symposium on Visual Lan-
guages, VL ’96, pages 336–, Washington, DC, USA.
IEEE Computer Society.
Zeiler, M. D. (2012). ADADELTA: an adaptive learning
rate method. CoRR, abs/1212.5701.
Gradient Descent Analysis: On Visualizing the Training of Deep Neural Networks
345