obtained when kernels from spatial, temporal, con-
text, 3D points and depth are combined within the
CMMKL-SVM approach. In this respect, the highest
recognition rates (92.83%) have been obtained when
a combination of trajectories, HOG, FPFH, Depth and
object is used. Due to the relevant importance to in-
telligent robots, our future work will focus on the im-
provement of multimodal fusion and the reduction of
the computationalburden by exploiting differentopti-
mization techniques for MKL, allowing a quicker re-
sponse of the robot to interact with humans by either
imitating or anticipating actions.
ACKNOWLEDGEMENTS
This research has been partially supported by the
Industrial Doctorate program of the Government of
Catalonia, and by the European Community through
the FP7 framework program by funding the Vinbot
project (N 605630) conducted by Ateknea Solutions
Catalonia.
REFERENCES
Bautista-Ballester, J., Verg´es-Llah´ı, J., and Puig, D. (2014).
Using action objects contextual information for a mul-
tichannel svm in an action recognition approach based
on bag of visual words. In International Conference
on Computer Vision Theory and Applications, VIS-
APP.
Bilinski, P. and Corvee, E. (2013). Relative Dense Track-
lets for Human Action Recognition. 10th IEEE Inter-
national Conference on Automatic Face and Gesture
Recognition.
Bucak, S., Jin, R., and Jain, A. (2014). Multiple kernel
learning for visual object recognition: A review. Pat-
tern Analysis and Machine Intelligence, IEEE Trans-
actions on, 36(7):1354–1369.
Bucak, S., Jin, R., and Jain, A. K. (2010). Multi-label mul-
tiple kernel learning by stochastic approximation: Ap-
plication to visual object recognition. In Advances in
Neural Information Processing Systems, pages 325–
333.
Chang, C.-C. and Lin, C.-J. (2011). LIBSVM:
A library for support vector machines. ACM
Transactions on Intelligent Systems and Tech-
nology, 2:27:1–27:27. Software available at
http://www.csie.ntu.edu.tw/ cjlin/libsvm.
Gehler, P. and Nowozin, S. (2009). Let the kernel figure
it out; principled learning of pre-processing for kernel
classifiers. In Computer Vision and Pattern Recogni-
tion (CVPR), 2009 IEEE Conference on, pages 2836–
2843. IEEE.
Ikizler-Cinbis, N. and Sclaroff, S. (2010). Object, scene and
actions: Combining multiple features for human ac-
tion recognition. In Proceedings of the 11th European
Conference on Computer Vision: Part I, ECCV’10,
pages 494–507, Berlin, Heidelberg. Springer-Verlag.
Kl¨aser, A., Marszałek, M., and Schmid, C. (2008). A spatio-
temporal descriptor based on 3d-gradients. In British
Machine Vision Conference, pages 995–1004.
Koppula, H. S., Gupta, R., and Saxena, A. (2013). Learn-
ing human activities and object affordances from rgb-
d videos. The International Journal of Robotics Re-
search, 32(8):951–970.
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre,
T. (2011). HMDB: a large video database for human
motion recognition. In Proceedings of the Interna-
tional Conference on Computer Vision (ICCV).
Lanckriet, G. R. G., Cristianini, N., Bartlett, P., El Ghaoui,
L., and Jordan, M. I. (2004). Learning the kernel ma-
trix with semidefinite programming. J. Mach. Learn.
Res., 5:27–72.
Laptev, I. (2005). On space-time interest points. Int. J.
Comput. Vision, 64(2-3):107–123.
Pieropan, A., Salvi, G., Pauwels, K., and Kjellstrom, H.
(2014). Audio-visual classification and detection of
human manipulation actions. In Intelligent Robots and
Systems (IROS 2014), 2014 IEEE/RSJ International
Conference on, pages 3045–3052.
Rakotomamonjy, A., Bach, F. R., Canu, S., and Grandvalet,
Y. (2008). Simplemkl. Journal of Machine Learning
Research.
Rusu, R. B. (2009). Semantic 3D Object Maps for Ev-
eryday Manipulation in Human Living Environments.
PhD thesis, Computer Science department, Technis-
che Universitaet Muenchen, Germany.
Snoek, C. G. M., Worring, M., and Smeulders, A. W. M.
(2005). Early versus late fusion in semantic video
analysis. In Proceedings of the 13th Annual ACM
International Conference on Multimedia, MULTIME-
DIA ’05, pages 399–402, New York, NY, USA. ACM.
Tsai, J.-S., Hsu, Y.-P., Liu, C., and Fu, L.-C. (2013). An ef-
ficient part-based approach to action recognition from
rgb-d video with bow-pyramid representation. In In-
telligent Robots and Systems (IROS), 2013 IEEE/RSJ
International Conference on, pages 2234–2239.
Vedaldi, A., Gulshan, V., Varma, M., and Zisserman, A.
(2009). Multiple kernels for object detection. In Pro-
ceedings of the International Conference on Computer
Vision, 2009.
Wang, H., Kl¨aser, A., Schmid, C., and Liu, C. (2013).
Dense trajectories and motion boundary descriptors
for action recognition. International Journal of Com-
puter Vision.
Wang, H., Kl¨aser, A., Schmid, C., and Liu, C.-L. (2011).
Action Recognition by Dense Trajectories. In IEEE
Conf. on Computer Vision & Pattern Recognition,
pages 3169–3176, Colorado Springs, United States.
Wang, H. and Schmid, C. (2013). Action Recognition with
Improved Trajectories. In ICCV 2013 - IEEE Interna-
tional Conference on Computer Vision, pages 3551–
3558, Sydney, Australie. IEEE.