
REFERENCES
Alain, G., Lamb, A., Sankar, C., Courville, A., and Bengio,
Y. (2015). Variance reduction in sgd by distributed im-
portance sampling. arXiv preprint arXiv:1511.06481.
Alfarra, M., Hanzely, S., Albasyoni, A., Ghanem, B., and
Richtarik, P. (2021). Adaptive learning of the optimal
batch size of sgd.
Balles, L., Romero, J., and Hennig, P. (2017). Coupling
adaptive batch sizes with learning rates. In Proceed-
ings of the 33rd Conference on Uncertainty in Artifi-
cial Intelligence (UAI), page ID 141.
Bordes, A., Ertekin, S., Weston, J., and Bottou, L. (2005).
Fast kernel classifiers with online and active learning.
Journal of Machine Learning Research, 6(54):1579–
1619.
Dong, C., Jin, X., Gao, W., Wang, Y., Zhang, H., Wu, X.,
Yang, J., and Liu, X. (2021). One backward from
ten forward, subsampling for large-scale deep learn-
ing. arXiv preprint arXiv:2104.13114.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
M., Heigold, G., Gelly, S., et al. (2020). An image is
worth 16x16 words: Transformers for image recogni-
tion at scale. arXiv preprint arXiv:2010.11929.
Faghri, F., Duvenaud, D., Fleet, D. J., and Ba, J. (2020).
A study of gradient variance in deep learning. arXiv
preprint arXiv:2007.04532.
Grittmann, P., Georgiev, I., Slusallek, P., and K
ˇ
riv
´
anek,
J. (2019). Variance-aware multiple importance sam-
pling. ACM Trans. Graph., 38(6).
Kahn, H. (1950). Random sampling (monte carlo) tech-
niques in neutron attenuation problems–i. Nucleonics,
6(5):27, passim.
Kahn, H. and Marshall, A. W. (1953). Methods of reduc-
ing sample size in monte carlo computations. Jour-
nal of the Operations Research Society of America,
1(5):263–278.
Katharopoulos, A. and Fleuret, F. (2017). Biased im-
portance sampling for deep neural network training.
ArXiv, abs/1706.00043.
Katharopoulos, A. and Fleuret, F. (2018). Not all sam-
ples are created equal: Deep learning with importance
sampling. In Dy, J. and Krause, A., editors, Pro-
ceedings of the 35th International Conference on Ma-
chine Learning, volume 80 of Proceedings of Machine
Learning Research, pages 2525–2534. PMLR.
Kingma, D. P. and Ba, J. (2014). Adam: A
method for stochastic optimization. arXiv preprint
arXiv:1412.6980.
Kondapaneni, I., V
´
evoda, P., Grittmann, P., Sk
ˇ
rivan, T.,
Slusallek, P., and K
ˇ
riv
´
anek, J. (2019). Optimal mul-
tiple importance sampling. ACM Transactions on
Graphics (TOG), 38(4):37.
Loshchilov, I. and Hutter, F. (2015). Online batch selection
for faster training of neural networks. arXiv preprint
arXiv:1511.06343.
Needell, D., Ward, R., and Srebro, N. (2014). Stochastic
gradient descent, weighted sampling, and the random-
ized kaczmarz algorithm. In Ghahramani, Z., Welling,
M., Cortes, C., Lawrence, N., and Weinberger, K., ed-
itors, Advances in Neural Information Processing Sys-
tems, volume 27. Curran Associates, Inc.
Owen, A. and Zhou, Y. (2000). Safe and effective impor-
tance sampling. Journal of the American Statistical
Association, 95(449):135–143.
Qi, C. R., Su, H., Mo, K., and Guibas, L. J. (2017). Point-
net: Deep learning on point sets for 3d classification
and segmentation. In Proceedings of the IEEE con-
ference on computer vision and pattern recognition,
pages 652–660.
Ren, H., Zhao, S., and Ermon, S. (2019). Adaptive an-
tithetic sampling for variance reduction. In Interna-
tional Conference on Machine Learning, pages 5420–
5428. PMLR.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986).
Learning representations by back-propagating errors.
nature, 323(6088):533–536.
Santiago, C., Barata, C., Sasdelli, M., Carneiro, G., and
Nascimento, J. C. (2021). Low: Training deep neural
networks by learning optimal sample weights. Pattern
Recognition, 110:107585.
Schaul, T., Quan, J., Antonoglou, I., and Silver, D.
(2015). Prioritized experience replay. arXiv preprint
arXiv:1511.05952.
Veach, E. (1997). Robust Monte Carlo methods for light
transport simulation, volume 1610. Stanford Univer-
sity PhD thesis.
Wang, L., Yang, Y., Min, R., and Chakradhar, S. (2017).
Accelerating deep neural network training with incon-
sistent stochastic gradient descent. Neural Networks,
93:219–229.
Zhang, C.,
¨
Oztireli, C., Mandt, S., and Salvi, G. (2019).
Active mini-batch sampling using repulsive point pro-
cesses. In Proceedings of the AAAI conference on Ar-
tificial Intelligence, volume 33, pages 5741–5748.
Zhang, M., Dong, C., Fu, J., Zhou, T., Liang, J., Liu,
J., Liu, B., Momma, M., Wang, B., Gao, Y., et al.
(2023). Adaselection: Accelerating deep learning
training through data subsampling. arXiv preprint
arXiv:2306.10728.
Zhao, P. and Zhang, T. (2015). Stochastic optimization with
importance sampling for regularized loss minimiza-
tion. In Bach, F. and Blei, D., editors, Proceedings of
the 32nd International Conference on Machine Learn-
ing, volume 37 of Proceedings of Machine Learning
Research, pages 1–9, Lille, France. PMLR.
Multiple Importance Sampling for Stochastic Gradient Estimation
409