Theorem 1. The classic clipped gradient descent
converges with high probability to θ
θ
θ(∞) such that
E
(x,y)∼µ
−∇
θ
ℓ(θ
θ
θ(∞),x,y)
c(θ
θ
θ(∞),x,y)
= 0, (22)
provided assumptions (A1) and (A2) are satisfied.
If we additionally assume (A3), then the adaptive
clipped gradient descent also converges with high
probability to θ(∞) satisfying (22).
Proof. In Lemma 2, we showed that H is a Mar-
chaud map. We also showed that the Martingale
noise vanishes asymptotically with high probability,
see Lemma 3. We may couple these Lemmas with the
assumptions made to apply the theory developed in
(Bena
¨
ım et al., 2005). It lets us conclude the follow-
ing. When the clipped gradient descent is numerically
stable, which we show happens with high probabil-
ity, the limit of the clipped gradient descent coincides
with the limit of an associated solution to the differ-
ential inclusion
˙
θ
θ
θ(t) ∈ −H(θ
θ
θ(t)), the limit is taken
as t → ∞. Further, this limit has the property that it
is an equilibrium of H. This means, −H(θ
θ
θ(∞)) = 0
given that θ
θ
θ(∞) is the limit. Since the clipped gradi-
ent schemes – classic and adaptive – are stable with
high probability, we get that it converges with high
probability to θ
θ
θ(∞) satisfying (22).
Colloquially speaking, the clipped gradient de-
scent converges, with high probability, to the set of
NN weights with the property that, on an average, the
clipped loss-gradient is zero. Since, the clipping fac-
tor is always positive, we may conclude that the loss-
gradient itself is zero on an average. By “average”,
we mean that an expectation taken with respect to the
data distribution µ. When the loss-gradient is zero, we
can conclude that in most circumstances the algorithm
has converged to a local minimum of the training loss
function.
5 CONCLUSIONS
In this paper, we considered the method of gradient
clipping used to control the exploding gradients prob-
lem in deep learning. We discussed the classic and
adaptive variants of norm-based clipping. For the
classic clipping method, we showed that it is numeri-
cally stable with high probability. It further converges
to a local minimum of the average loss function. In
the case of adaptive clipping, we observed that the
updates are in the order of the NN weights. Due to
the linear growth of the update as a function of the
NN weights, one cannot guarantee stability without
additional assumptions. However, we observed that
we may dip into the theory of linear dynamical sys-
tems in order to ensure stability in the adaptive clip-
ping scheme. Once numerical stability is guaranteed,
the adaptive clipping method converges, as before, to
a local minimizer of the average training loss. It must
be noted that the averaging in both variants is with
respect to the distribution that generated the training
data.
REFERENCES
Aubin, J.-P. and Cellina, A. (2012). Differential inclu-
sions: set-valued maps and viability theory, volume
264. Springer Science & Business Media.
Bena
¨
ım, M., Hofbauer, J., and Sorin, S. (2005). Stochas-
tic approximations and differential inclusions. SIAM
Journal on Control and Optimization, 44(1):328–348.
Bercu, B., Delyon, B., and Rio, E. (2015). Concentration
inequalities for sums and martingales. Springer.
Bishop, C. M. and Nasrabadi, N. M. (2006). Pattern recog-
nition and machine learning, volume 4. Springer.
Borkar, V. S. (2009). Stochastic approximation: a dynami-
cal systems viewpoint, volume 48. Springer.
Boyd, S. and Vandenberghe, L. (2004). Convex optimiza-
tion. Cambridge university press.
Durrett, R. (2018). Stochastic calculus: a practical intro-
duction. CRC press.
Durrett, R. (2019). Probability: theory and examples, vol-
ume 49. Cambridge university press.
Halmos, P. R. (2017). Finite-dimensional vector spaces.
Courier Dover Publications.
LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learn-
ing. nature, 521(7553):436–444.
Ramaswamy, A. and Bhatnagar, S. (2017). A generaliza-
tion of the borkar-meyn theorem for stochastic recur-
sive inclusions. Mathematics of Operations Research,
42(3):648–661.
Ramaswamy, A. and Bhatnagar, S. (2018). Stability of
stochastic approximations with “controlled markov”
noise and temporal difference learning. IEEE Trans-
actions on Automatic Control, 64(6):2614–2620.
Rehmer, A. and Kroll, A. (2020). On the vanishing and
exploding gradient problem in gated recurrent units.
IFAC-PapersOnLine, 53(2):1243–1248.
Rudin, W. et al. (1976). Principles of mathematical analy-
sis, volume 3. McGraw-hill New York.
Schmidhuber, J. (2015). Deep learning in neural networks:
An overview. Neural networks, 61:85–117.
Vu, T. and Raich, R. (2022). On asymptotic linear con-
vergence of projected gradient descent for constrained
least squares. IEEE Transactions on Signal Process-
ing, 70:4061–4076.
Zhang, J., He, T., Sra, S., and Jadbabaie, A. (2019). Why
gradient clipping accelerates training: A theoretical
justification for adaptivity. In International Confer-
ence on Learning Representations.
ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods
114