videos using cnn-extracted features. The Visual Com-
puter.
Alfaifi, R. and Artoli, A. (2020). Human action prediction
with 3d-cnn. SN Computer Science.
Bacharidis, K. and Argyros, A. (2020). Improving deep
learning approaches for human activity recognition
based on natural language processing of action labels.
In IJCNN. IEEE.
Bochkovskiy, A., Wang, C., and Liao, H. (2020). Yolov4:
Optimal speed and accuracy of object detection.
arXiv:2004.10934.
Cao, K., Ji, J., Cao, Z., Chang, C.-Y., and Niebles, J. C.
(2020). Few-shot video classification via temporal
alignment. In CVPR.
Chang, C.-Y., Huang, D.-A., Sui, Y., Fei-Fei, L., and
Niebles, J. C. (2019). D3tw: Discriminative differ-
entiable dynamic time warping for weakly supervised
action alignment and segmentation. In CVPR.
Cuturi, M. and Blondel, M. (2017). Soft-dtw: a differen-
tiable loss function for time-series. arXiv:1703.01541.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-
Fei, L. (2009). Imagenet: A large-scale hierarchical
image database. In IEEE CVPR.
Dvornik, N., Hadji, I., Derpanis, K. G., Garg, A., and Jep-
son, A. D. (2021). Drop-dtw: Aligning common sig-
nal between sequences while dropping outliers. arXiv
preprint arXiv:2108.11996.
Ghoddoosian, R., Sayed, S., and Athitsos, V. (2021). Ac-
tion duration prediction for segment-level alignment
of weakly-labeled videos. In IEEE WACV.
Hadji, I., Derpanis, K. G., and Jepson, A. D. (2021). Repre-
sentation learning via global temporal alignment and
cycle-consistency. arXiv preprint arXiv:2105.05217.
Haresh, S., Kumar, S., Coskun, H., Syed, S. N., Konin, A.,
Zia, M. Z., and Tran, Q.-H. (2021). Learning by align-
ing videos in time. arXiv preprint arXiv:2103.17260.
Kim, D., Jang, M., Yoon, Y., and Kim, J. (2015). Clas-
sification of dance motions with depth cameras using
subsequence dynamic time warping. In SPPR. IEEE.
Koppula, H., Gupta, R., and Saxena, A. (2013). Learn-
ing human activities and object affordances from rgb-
d videos. The International Journal of Robotics Re-
search.
Manousaki, V., Papoutsakis, K., and Argyros, A. (2018).
Evaluating method design options for action classifi-
cation based on bags of visual words. In VISAPP.
Manousaki, V., Papoutsakis, K., and Argyros, A. (2021).
Action prediction during human-object interaction
based on dtw and early fusion of human and object
representations. In ICVS. Springer.
Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., and Bajcsy, R.
(2013). Berkeley mhad: A comprehensive multimodal
human action database. In IEEE WACV.
Panagiotakis, C., Papoutsakis, K., and Argyros, A. (2018).
A graph-based approach for detecting common ac-
tions in motion capture data and videos. In Pattern
Recognition.
Papoutsakis, K., Panagiotakis, C., and Argyros, A. (2017a).
Temporal action co-segmentation in 3d motion cap-
ture data and videos. In CVPR.
Papoutsakis, K., Panagiotakis, C., and Argyros, A. A. Tem-
poral action co-segmentation in 3d motion capture
data and videos. In CVPR 2017. IEEE.
Papoutsakis, K., Panagiotakis, C., and Argyros, A. A.
(2017b). Temporal action co-segmentation in 3d mo-
tion capture data and videos. In CVPR.
Park, A. S. and Glass, J. R. (2007). Unsupervised pattern
discovery in speech. IEEE Transactions on Audio,
Speech, and Language Processing.
Reily, B., Han, F., Parker, L., and Zhang, H. (2018).
Skeleton-based bio-inspired human activity prediction
for real-time human–robot interaction. Autonomous
Robots.
Rius, I., Gonz
`
alez, J., Varona, J., and Roca, F. (2009).
Action-specific motion prior for efficient bayesian 3d
human body tracking. In Pattern Recognition.
Roditakis, K., Makris, A., and Argyros, A. (2021). Towards
improved and interpretable action quality assessment
with self-supervised alignment.
R.Tavenard, Faouzi, J., Vandewiele, G., Divo, F., Androz,
G., Holtz, C., Payne, M., Yurchak, R., Rußwurm, M.,
Kolar, K., and Woods, E. (2020). Tslearn, a machine
learning toolkit for time series data. Journal of Ma-
chine Learning Research.
Sakoe, H. and Chiba, S. (1978). Dynamic programming
algorithm optimization for spoken word recognition.
IEEE transactions on acoustics, speech, and signal
processing.
Schez-Sobrino, S., Monekosso, D. N., Remagnino, P.,
Vallejo, D., and Glez-Morcillo, C. (2019). Automatic
recognition of physical exercises performed by stroke
survivors to improve remote rehabilitation. In MAPR.
Schuldt, C., Laptev, I., and Caputo, B. (2004). Recognizing
human actions: a local svm approach. In ICPR. IEEE.
Simonyan, K. and Zisserman, A. (2014). Very deep con-
volutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556.
Tormene, P., Giorgino, T., Quaglini, S., and Stefanelli, M.
(2009). Matching incomplete time series with dy-
namic time warping: an algorithm and an application
to post-stroke rehabilitation. Artificial intelligence in
medicine.
Wang, J., Liu, Z., Wu, Y., and Yuan, J. (2012). Mining
actionlet ensemble for action recognition with depth
cameras. In IEEE CVPR.
Xia, L. and Aggarwal, J. (2013). Spatio-temporal depth
cuboid similarity feature for activity recognition using
depth camera. In IEEE CVPR.
Yang, C.-K. and Tondowidjojo, R. (2019). Kinect v2 based
real-time motion comparison with re-targeting and
color code feedback. In IEEE GCCE.
Segregational Soft Dynamic Time Warping and Its Application to Action Prediction
235