0 1 2 3 4 5 6 7
0
0.1
0.2
0.3
0.4
0.5
Time (h)
Test error
d=1024
ASSET
F
ASSET*
F
0 1 2 3 4 5 6 7
0
0.1
0.2
0.3
0.4
0.5
Time (h)
Test error
d=4096
0 1 2 3 4 5 6 7
0
0.1
0.2
0.3
0.4
0.5
Time (h)
Test error
d=16384
Figure 2: Progress of ASSET
F
and ASSET
∗
F
to their com-
pletion (
MNIST-E
), in terms of test error rate.
when λ approaches zero for linear SVMs.
4.3 Large-scale Performance
We take the final data set
MNIST-E
and compare the
performance of ASSET
F
and ASSET
∗
F
to the online
SVM code LASVM. For a fair comparison, we fed the
training samples to the algorithms in the same order.
Figure 2 shows the progress on a single run of
our algorithms, with various approximation dimen-
sions d (which is equal to s in this case) in the range
[1024, 16384]. Vertical bars in the graphs indicate
the completion of training. ASSET
F
tends to con-
verge faster and shows smaller test error values than
ASSET
∗
F
, despite the theoretical slower convergence
rate of the former. With d = 16384, ASSET
F
and
ASSET
∗
F
required 7.2 hours to finish with a solu-
tion of 2.7% and 3.5% test error rate, respectively.
LASVM produced a better solution with only 0.2%
test error rate, but it required 4.3 days of computation
to complete a single pass through the same data.
5 CONCLUSIONS
We haveproposed a stochastic gradient framework for
training large-scale and online SVMs using efficient
approximations to nonlinear kernels, which can be ex-
tended easily to other kernel-based learning problems.
ACKNOWLEDGEMENTS
The authors acknowledge the support of NSF Grants
DMS-0914524 and DMS-0906818. Part of this work
has been supported by the German Research Founda-
tion (DFS) grant for the Collaborative Research Cen-
ter SFB 876: “Providing Information by Resource-
Constrained Data Analysis”.
REFERENCES
Bordes, A., Ertekin, S., Weston, J., and Bottou, L. (2005).
Fast kernel classifiers with online and active learning.
Journal of Machine Learning Research, 6:1579–1619.
Bottou, L. (2005). SGD: Stochastic gradient descent.
http://leon.bottou.org/projects/sgd.
Chapelle, O. (2007). Training a support vector machine in
the primal. Neural Computation, 19:1155–1178.
Drineas, P. and Mahoney, M. W. (2005). On the nystrom
method for approximating a gram matrix for improved
kernel-based learning. Journal of Machine Learning
Research, 6:2153–2175.
Franc, V. and Sonnenburg, S. (2008). Optimized cutting
plane algorithm for support vector machines. In Pro-
ceedings of the 25th International Conference on Ma-
chine Learning, pages 320–327.
Joachims, T. (1999). Making large-scale support vector ma-
chine learning practical. In Advances in Kernel Meth-
ods - Support Vector Learning, pages 169–184. MIT
Press.
Joachims, T. (2006). Training linear SVMs in linear time.
In International Conference On Knowledge Discovery
and Data Mining, pages 217–226.
Joachims, T., Finley, T., and Yu, C.-N. (2009). Cutting-
plane training of structural svms. Machine Learning,
77(1):27–59.
Joachims, T. and Yu, C.-N. J. (2009). Sparse kernel svms
via cutting-plane training. Machine Learning, 76(2-
3):179–193.
Lee, S. and Wright, S. J. (2011). Approximate stochastic
subgradient estimation training for support vector ma-
chines. http://arxiv.org/abs/1111.0432.
Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A.
(2009). Robust stochastic approximation approach to
stochastic programming. SIAM Journal on Optimiza-
tion, 19(4):1574–1609.
Nemirovski, A. and Yudin, D. B. (1983). Problem complex-
ity and method efficiency in optimization. John Wiley.
Rahimi, A. and Recht, B. (2008). Random features for
large-scale kernel machines. In Advances in Neu-
ral Information Processing Systems 20, pages 1177–
1184. MIT Press.
Shalev-Shwartz, S., Singer, Y., and Srebro, N. (2007). Pe-
gasos: Primal estimated sub-gradient solver for svm.
In Proceedings of the 24th International Conference
on Machine Learning, pages 807–814.
Shalev-Shwartz, S., Singer, Y., Srebro, N., and Cotter,
A. (2011). Pegasos: Primal estimated sub-gradient
solver for svm. Mathematical Programming, Series
B, 127(1):3–30.
Zinkevich, M. (2003). Online convex programming and
generalized infinitesimal gradient ascent. In Proceed-
ings of the 20th International Conference on Machine
Learning, pages 928–936.
ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods
228