shows the loss trajectory when the weight update is in
single precision and only gradient computations are
performed using Bfloat number format. In this ex-
periment, the stochastic bound is numerically equal
to the single-precision SGD, confirming what already
observed in Figure 8.
Figure 12: Logistic regression trained using single-
precision SGD and a fixed learning rate.
Figure 13: Logistic regression trained using Bfloat gra-
dients with accumulator size of 50, and single-precision
weight update (s
k
= 0 and r
k
̸= 0).
6 CONCLUSION
We have studied the convergence of low-precision
floating-point SGD for quasi-convex loss func-
tions and extended some existing deterministic and
stochastic bounds for convex loss functions. In our
theoretical setup, we considered numerical errors for
weight update and gradient computations. We have
also derived the optimal step size as a corollary of our
theoretical results. Furthermore, in our experiments,
the effect of numerical errors on weight update and
gradient computations are demonstrated. Our experi-
ments show that the accumulator mantissa size plays
a key role in reducing the numerical error and im-
proving the convergence of SGD. Although our ex-
periments with logistic regression are promising, ex-
tension of the experiments for more complex models
is an appealing direction as the future work.
REFERENCES
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., et al. (2020). Language models are few-
shot learners. Advances in neural information pro-
cessing systems, 33:1877–1901.
Furuya, T., Suetake, K., Taniguchi, K., Kusumoto, H.,
Saiin, R., and Daimon, T. (2022). Spectral pruning
for recurrent neural networks. In International Con-
ference on Artificial Intelligence and Statistics, pages
3458–3482. PMLR.
Ghaffari, A., Tahaei, M. S., Tayaranian, M., Asgharian,
M., and Partovi Nia, V. (2022). Is integer arithmetic
enough for deep learning training? Advances in Neu-
ral Information Processing Systems. to appear.
Goyal, P., Duval, Q., Seessel, I., Caron, M., Singh, M.,
Misra, I., Sagun, L., Joulin, A., and Bojanowski, P.
(2022). Vision models are more robust and fair when
pretrained on uncurated images without supervision.
arXiv preprint arXiv:2202.08360.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In Proceedings of
the IEEE conference on computer vision and pattern
recognition, pages 770–778.
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D.,
Wang, W., Weyand, T., Andreetto, M., and Adam,
H. (2017). Mobilenets: Efficient convolutional neu-
ral networks for mobile vision applications. arXiv
preprint arXiv:1704.04861.
Hu, Y., Yang, X., and Sim, C.-K. (2015). Inexact sub-
gradient methods for quasi-convex optimization prob-
lems. European Journal of Operational Research,
240(2):315–327.
Hu, Y., Yu, C., and Li, C. (2016). Stochastic subgradi-
ent method for quasi-convex optimization problems.
Journal of nonlinear and convex analysis, 17:711–
724.
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and
Bengio, Y. (2016). Binarized neural networks. Ad-
vances in neural information processing systems, 29.
Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard,
A., Adam, H., and Kalenichenko, D. (2018). Quan-
tization and training of neural networks for efficient
integer-arithmetic-only inference. In Proceedings of
the IEEE conference on computer vision and pattern
recognition, pages 2704–2713.
Kiwiel, K. and Murty, K. (1996). Convergence of the steep-
est descent method for minimizing quasiconvex func-
tions. Journal of Optimization Theory and Applica-
tions, 89.
Kiwiel, K. C. (2001). Convergence and efficiency of
subgradient methods for quasiconvex minimization.
Mathematical Programming, 90:1–25.
Li, X., Liu, B., Yu, Y., Liu, W., Xu, C., and Partovi Nia,
V. (2021). S3: Sign-sparse-shift reparametrization for
effective training of low-bit shift networks. Advances
in Neural Information Processing Systems, 34:14555–
14566.
ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods
548