we considered 121 sequences from same unknown
and activity and 154 sequences from different
unknown and activity for each class, in order to ba-
lance activity classes a nd unk nown class. In this way,
dataset has 12177 sequences. This is the input for Sia-
mese n etwork (Koch et al., 2015) that con sists of twin
networks which accept distinct inputs but the weig-
hts are shared. During training the two sub-network s
extract features from two inputs, while the jo ining
neuron m e asures the distance between the two feature
vectors.
In our experime nt, euclidean metrics is used to
calculate the distances between inputs. Th e contras-
tive loss f unction has been used. Three convolutio-
nal layers are considered with numbers of filters 32,
64 and 2, respectively, all of size 3 × 1 and a relu
activatio n function. The output of each convolutional
layer is reduced in size using a max-pooling layer that
halves the number of features. A k-nearest-neig hbor
classification algorithm (K-NN) and a support vector
machine (SVM) are used on features for classification
purposes.
5 STAGE OF THE RESEARCH
Our preliminary results prove that we can pred ict
daily activity from multim odal data. In particular,
Stanford-ECM Data set has bee n considered and we
implemented a siamese network to build an embe d-
ding space. The performance of the experiment has
been evaluated with a SVM for different kernels and
a K -NN for different values of K.
In fu ture works, we want to improve our pipeline
and test its o n other datasets. The result that I expect
and that should validate the problem an d the approach
is to overcome the values of accuracy in classification
baseline.
ACKNOWLEDGEMENTS
I would like to thank my advisors Prof. Sebastiano
Battiato (University of Catania), Prof. Giovanni Ma-
ria Farinella (University of Catania) and Ing. Vale-
ria Tomaselli (STMicroelectronic s, Catania) for their
continued supervision and support.
REFERENCES
Aytar, Y., Vondrick, C ., and Torralba, A. (2017). See,
hear, and read: Deep aligned representations. CoRR,
abs/1706.00932.
Bano, S., Cavallaro, A., and Parra, X. (2015). Gyro-based
camera-motion detection in user-generated videos. In
Proceedings of the 23rd ACM International Confe-
rence on Multimedia, MM ’15, pages 1303–1306,
New York, NY, USA. ACM.
Bostr¨om, H., Andler, S. F., Brohede, M., Laere, J. V., Ni-
klasson, L., Nilsson, M., Persson, A., and Ziemke, T.
(2007). On the definition of information fusion as a
field of research. Technical report.
B¨utepage, J., Black, M. J. , Kragic, D., and Kjellstr¨om,
H. (2017). Deep representation learning for hu-
man motion prediction and cl assification. CoRR,
abs/1702.07486.
Chan, F.-H., Chen, Y.-T., Xiang, Y., and Sun, M. (2017).
Anticipating accidents in dashcam videos. In Lai,
S.-H., Lepetit, V., Nishino, K., and Sato, Y., edi-
tors, Computer Vision – ACCV 2016, pages 136–153,
Cham. Springer International Publishing.
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., and Fei-Fei,
L. (2009). Imagenet: A large-scale hierarchical image
database. In 2009 IEEE Conference on Computer Vi-
sion and Pattern Recognition, pages 248–255.
Duarte, N., Tasevski, J., Coco, M. I., Rakovic, M., and
Santos-Victor, J. (2018). Action anticipation: Re-
ading the intentions of humans and robots. CoRR,
abs/1802.02788.
Furnari, A., Battiato, S., Grauman, K ., and Farinella, G. M.
(2017). Next-active-object prediction from egocentric
videos. Journal of Visual Communication and Image
Representation, 49:401 – 411.
Gao, J., Yang, Z., and Nevatia, R. (2017). RED: reinfor-
ced encoder-decoder networks for action anticipation.
CoRR, abs/1707.04818.
Koch, G., Zemel, R., and Salakhutdinov, R. ( 2015). Sia-
mese neural networks for one-shot image recognition.
Koppula, H. S., Jain, A., and Saxena, A. (2016). Antici-
patory Planning f or Human-Robot Teams, pages 453–
470. Springer International Publishing, Cham.
Koppula, H. S. and Saxena, A. (2016). Anticipating human
activities using object affordances for reactive robo-
tic response. IEEE Trans. Pattern Anal. Mach. Intell. ,
38(1):14–29.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).
Imagenet classification with deep convolutional neu-
ral networks. In Pereira, F., Burges, C. J. C., Bottou,
L., and Weinberger, K. Q., editors, Advances in Neu-
ral Information Processing Systems 25, pages 1097–
1105. Curran Associates, Inc.
Lan, T., Chen, T.-C., and Savarese, S. (2014). A hierar-
chical r epresentation for future action prediction. In
Fleet, D., Pajdla, T., Schiele, B., and Tuytelaars, T.,
editors, Computer Vision – ECCV 2014, pages 689–
704, Cham. Springer International Publishing.
LeCun, Y., Bengio, Y., and Hinton, G. E. (2015). Deel l ear-
ning. Nature, 521(7553):436–444.
Ma, S., Sigal, L., and Sclaroff, S. (2016). Learning activity
progression in lstms for activity detection and early
detection. In 2016 IEEE Conference on Computer Vi-
sion and Pattern Recognition (CVPR ) , pages 1942–
1950.