
posed in this paper, demonstrating the potential for
improved accuracy through the use of diverse sam-
pling strategies.
Limitations. As the algorithm rely on past informa-
tion to drive a non-uniform sampling of data, it re-
quires seeing the same data multiple times. This cre-
ates a bottleneck for architectures that rely on pro-
gressive data streaming. More research is needed
to design importance sampling algorithms for data
streaming architectures, which is a promising future
direction. Non-uniform data sampling can also cre-
ate slower runtime execution. The samples selected
in a mini-batch are not laid out contiguously in mem-
ory leading to a slower loading. We believe a careful
implementation can mitigate this issue.
8 CONCLUSION
In conclusion, our work introduces an efficient sam-
pling strategy for machine learning optimization, that
can be use for importance sampling and data prun-
ing. This strategy, which relies on the gradient of the
loss and has minimal computational overhead, was
tested across various classification as well as regres-
sion tasks with promising results. Our work demon-
strates that by paying more attention to samples with
critical training information, we can speed up conver-
gence without adding complexity. We hope our find-
ings will encourage further research into simpler and
more effective sampling strategies for machine learn-
ing.
REFERENCES
Alain, G., Lamb, A., Sankar, C., Courville, A., and Bengio,
Y. (2015). Variance reduction in sgd by distributed im-
portance sampling. arXiv preprint arXiv:1511.06481.
Bordes, A., Ertekin, S., Weston, J., and Bottou, L. (2005).
Fast kernel classifiers with online and active learning.
Journal of Machine Learning Research, 6(54):1579–
1619.
Coleman, C., Yeh, C., Mussmann, S., Mirzasoleiman, B.,
Bailis, P., Liang, P., Leskovec, J., and Zaharia, M.
(2020). Selection via proxy: Efficient data selection
for deep learning. In International Conference on
Learning Representations.
Dong, C., Jin, X., Gao, W., Wang, Y., Zhang, H., Wu, X.,
Yang, J., and Liu, X. (2021). One backward from
ten forward, subsampling for large-scale deep learn-
ing. arXiv preprint arXiv:2104.13114.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby,
N. (2021). An image is worth 16x16 words: Trans-
formers for image recognition at scale. In Interna-
tional Conference on Learning Representations.
Faghri, F., Duvenaud, D., Fleet, D. J., and Ba, J. (2020).
A study of gradient variance in deep learning. arXiv
preprint arXiv:2007.04532.
Gower, R. M., Loizou, N., Qian, X., Sailanbayev, A.,
Shulgin, E., and Richt
´
arik, P. (2019). SGD: Gen-
eral analysis and improved rates. In Chaudhuri, K.
and Salakhutdinov, R., editors, Proceedings of the
36th International Conference on Machine Learning,
volume 97 of Proceedings of Machine Learning Re-
search, pages 5200–5209. PMLR.
Gower, R. M., Schmidt, M., Bach, F., and Richt
´
arik, P.
(2020). Variance-reduced methods for machine learn-
ing. Proceedings of the IEEE, 108(11):1968–1983.
Hanchi, A. E., Stephens, D., and Maddison, C. (2022).
Stochastic reweighted gradient descent. In Chaudhuri,
K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and
Sabato, S., editors, Proceedings of the 39th Interna-
tional Conference on Machine Learning, volume 162
of Proceedings of Machine Learning Research, pages
8359–8374. PMLR.
Har-Peled, S. and Kushal, A. (2005). Smaller coresets
for k-median and k-means clustering. In Proceedings
of the Twenty-First Annual Symposium on Computa-
tional Geometry, SCG ’05, page 126–134, New York,
NY, USA. Association for Computing Machinery.
Horace He, R. Z. (2021). functorch: Jax-like composable
function transforms for pytorch. https://github.com/
pytorch/functorch.
Johnson, R. and Zhang, T. (2013). Accelerating stochastic
gradient descent using predictive variance reduction.
Advances in neural information processing systems,
26.
Katharopoulos, A. and Fleuret, F. (2017). Biased im-
portance sampling for deep neural network training.
ArXiv, abs/1706.00043.
Katharopoulos, A. and Fleuret, F. (2018). Not all sam-
ples are created equal: Deep learning with importance
sampling. In Dy, J. and Krause, A., editors, Pro-
ceedings of the 35th International Conference on Ma-
chine Learning, volume 80 of Proceedings of Machine
Learning Research, pages 2525–2534. PMLR.
Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple
layers of features from tiny images.
Le, Y. and Yang, X. (2015). Tiny imagenet visual recogni-
tion challenge. CS 231N, 7(7):3.
Loshchilov, I. and Hutter, F. (2015). Online batch selection
for faster training of neural networks. arXiv preprint
arXiv:1511.06343.
Needell, D., Ward, R., and Srebro, N. (2014). Stochastic
gradient descent, weighted sampling, and the random-
ized kaczmarz algorithm. In Ghahramani, Z., Welling,
M., Cortes, C., Lawrence, N., and Weinberger, K., ed-
itors, Advances in Neural Information Processing Sys-
tems, volume 27. Curran Associates, Inc.
Nilsback, M.-E. and Zisserman, A. (2008). Automated
flower classification over a large number of classes.
Online Importance Sampling for Stochastic Gradient Optimization
139