Human Action Recognition using Multi-Kernel Learning for Temporal

Residual Network

Saima Nazir

1,2,4

, Yu Qian

, Muhammad Haroon Yousaf

, Sergio A. Velastin

2,3,4

, Ebroul Izquierdo

and Eduard Vazquez

University of Engineering and Technology Taxila, Pakistan

Queen Mary University of London, U.K.

Universidad Carlos III de Madrid, Spain

Cortexica Vision Systems Ltd., U.K.

Keywords:

Deep Learning, Residual Network, Spatio-Temporal Network, Temporal Residual Network, Human Action

Recognition.

Abstract:

Deep learning has led to a series of breakthrough in the human action recognition ﬁeld. Given the powerful

representational ability of residual networks (ResNet), performance in many computer vision tasks including

human action recognition has improved. Motivated by the success of ResNet, we use the residual network

and its variations to obtain feature representation. Bearing in mind the importance of appearance and motion

information for action representation, our network utilizes both for feature extraction. Appearance and motion

features are further fused for action classiﬁcation using a multi-kernel support vector machine (SVM). We also

investigate the fusion of dense trajectories with the proposed network to boost up the network performance.

We evaluate our proposed methods on a benchmark dataset (HMDB-51) and results shows the multi-kernel

learning shows the better performance than the fusion of classiﬁcation score from deep network SoftMax layer.

Our proposed method also shows good performance as compared to the recent state-of-the-art methods.

1 INTRODUCTION

Human action recognition is an important yet challen-

ging task in computer vision ﬁeld (Nazir et al., 2018).

For the last decade, many researchers are focusing on

deep convolutional networks for human action recog-

nition. Since the introduction of a large scale image

repository (ImageNet) (Deng et al., 2009) and high

performance computational machines (GPUs), image

and action recognition using deep convolutional neu-

ral network (CNN) is enjoying a remarkable triumph

(Tran et al., 2018; Pham et al., 2018; Herath et al.,

2017). However to achieve accurate recognition at

the level of human understanding of ongoing actions

in real environment is still strenuous.

After the introduction of AlexNet (Krizhevsky

et al., 2012), researchers are now focusing on incre-

asing the depth of state-of-the-art CNN architectures.

Networks such as VGG network (Simonyan and Zis-

serman, 2014b) are going deeper and deeper having

8 layers (VGG-M) to 16 (VGG-16) and 19 (VGG-19)

layers respectively. However simply stacking layers

to go deeper does not guarantee increase in perfor-

mance. He et al., (He et al., 2016) state that a 56-layer

plain network shows higher training and test error as

compared to a 22-layer plain network. Consequently,

deeper networks are harder to train, and their perfor-

mance may get saturated or keep on decreasing by

adding more layers.

To overcome this problem, He et al., (He et al.,

2016) proposed a novel residual network, ResNet, by

introducing identity shortcut connections to ﬁt the in-

put from the previous layer to the next layer without

any modiﬁcation of the input. To get this identity

mapping, they introduced a novel residual learning

framework. The residual function is introduced to let

the layers ﬁt the residual mapping.

After the introduction of such deeper network,

ResNet (He et al., 2016), many researchers have pro-

posed variations of ResNet to improve performance in

different computer vision applications such as image

classiﬁcation, object and, action recognition. Initially

ResNet (He et al., 2016) was introduced for image re-

cognition task.

Many researchers are endeavoring for the exten-

sion of 2D convolutional neural network architecture

to the temporal domain. As shown in ﬁg. 1 (b),

Feichtenhofer et al. (Feichtenhofer et al., 2016a) inte-

grated the space-time information by injecting a resi-

dual connection between the appearance and motion

420

Nazir, S., Qian, Y., Yousaf, M., Velastin, S., Izquierdo, E. and Vazquez, E.

Human Action Recognition using Multi-Kernel Learning for Temporal Residual Network.

DOI: 10.5220/0007371104200426

In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2019), pages 420-426

ISBN: 978-989-758-354-4

Figure 1: (a) Original residual unit [ResNet] (b) Spatio-temporal residual unit [ST-ResNet] (c) spatio-temporal residual unit

[T-ResNet].

pathways of a two stream architecture. To operate

on adjacent feature maps in time domain they app-

lied spatiotemporal ﬁlter kernel instead of spatial ﬁl-

ter kernel.

To transform spatial ConvNet into spatiotemporal

ConvNet, Feichtenhofer et al. (Feichtenhofer et al.,

2017b) modify the basic residual block (He et al.,

2016) and inject the temporal information using 1D

temporal ﬁlter. As shown in ﬁg. 1 (c) they introdu-

ced an additional convolutional block in the original

residual unit along with afﬁne scaling layer S

The closest work to ours is temporal ResNet

(Feichtenhofer et al., 2017b), where they augment

the temporal information in the spatial residual net-

work with convolution across time. In contrast to their

work, we extracted the features from temporal ResNet

for both appearance and ﬂow streams and fused both

network features for the classiﬁcation using a multi-

kernel SVM.

2 RELATED WORK

Human action recognition has remained one of the

important and challenging tasks in computer vision

ﬁeld. Many researchers have contributed by propo-

sing hand-crafted feature representation for action re-

cognition. Spatio-temporal interest point (3D Harris)

(Laptev, 2005), 3D-scale invariant feature transform

(3D SIFT) (Scovanner et al., 2007) and dense tra-

jectories (Wang et al., 2011) are a few of the state-

of-the-art hand-crafted feature representation appro-

aches. Since the introduction of CNN, many resear-

chers are now concentrating on the use of deep le-

arning approaches for the task of action recognition.

Zeiler and Fergus (Zeiler and Fergus, 2014) stated

that convolutional neural network can serve the pur-

pose of feature extraction as the ﬁrst layer learns low

level features and high-level features are learned by

top layers in CNNs.

Many researchers have extended the traditional

2D convolutional neural network (CNN) to integrate

temporal information. 3D CNN is proposed by Hara

et al., (Hara et al., 2017) to assimilate the temporal in-

formation by using a 3D kernel to extract information

from both spatial and temporal domain. Limiting spa-

tial space looks reasonable whereas, for the extraction

of motion information, limiting time domain to a few

frames seems deﬁcient. Ng et al. (Yue-Hei Ng et al.,

2015) stated that instead of learning temporal infor-

mation from few frames, temporal pooling can be ap-

plied to integrate the temporal domain information in

network layers.

Simonyan et al. (Simonyan and Zisserman,

2014a) extended the convolutional neural network by

proposing a two-stream architecture. Similar to the

way human vision cortex learns appearance and mo-

tion information using ventral (object recognition)

and dorsal (motion recognition) stream, they used

two pathways to learn spatial and temporal informa-

tion for appearance and motion respectively. Spatial

stream is learned on the model pretrained on Ima-

geNet while temporal stream is trained on an action

recognition dataset considering the unavailability of

pretrained model on ﬂow information.

Following sucess of the of two-stream architec-

ture, Feichtenhofer et al. (Feichtenhofer et al., 2016a)

extended the two-stream convolutional neural net-

Human Action Recognition using Multi-Kernel Learning for Temporal Residual Network

421

Figure 2: Proposed framework: Multi-kernel SVM learning using temporal residual network features for human action recog-

nition.

work and proposed a spatio-temporal residual net-

work. Driven by the success of ResNet in ILSVRC

2015, they used ResNet which is pretrained on Image-

Net for action recognition. They embed the residual

connection between appearance and motion streams

to allow the hierarchical learning of spatiotemporal

features for both streams.

Hara et al. (Hara et al., 2017) also utilized the Res-

Net model for human action classiﬁcation task. They

make use of 3D convolutional kernels to integrate

temporal information in residual network. They trai-

ned and evaluated their proposed 3D ResNet model

on the relatively larger datasets (Kinetics and Acti-

vity Net) available for human action recognition task.

Temporal ResNet (Feichtenhofer et al., 2017b) also

integrates the temporal information in spatial residual

network by modifying the basic residual function by

adding an additional temporal conv block.

3 PROPOSED METHOD

In this paper, we evaluate the performance of resi-

dual network (ResNet) and its variations for human

action recognition. We adopt two-stream framework,

the networks are trained on appearance (RGB frames)

and motion (optical ﬂow) respectively. We utilize the

residual network (ResNet) (He et al., 2016), Spatio-

temporal residual network (ST-ResNet) (Feichtenho-

fer et al., 2016a) and temporal residual network (T-

ResNet) (Feichtenhofer et al., 2017b) for the feature

extraction task. We extract features for both appea-

rance and motion representation. Later these featu-

res are passed to Multi-kernel support vector machine

(SVM) for the action classiﬁcation task. Fig. 2 shows

the general framework for action classiﬁcation using a

multi-kernel SVM where features are extracted from

RGB and ﬂow stream supplied to temporal residual

networks. Using the approach shown in ﬁg. 2 we

have also extracted the features for the other two net-

works, i.e. residual network and spatio-temporal re-

sidual network. Details of features extraction from

these three different residual networks are provided

in section 3.1, 3.2 and 3.3.

3.1 Feature Extraction using Residual

Layers (ResNet)

As the base representation, we utilize the ResNet50

model provided by (Feichtenhofer et al., 2016a) for

feature extraction. They used a ResNet50 (He et al.,

2016) model pretrained on ImageNet and replace

the last classiﬁcation layer as per the requirement

of the HMDB51 dataset. We truncate the network

and obtain 2048 features from the pooling layer after

convolution block 5 for both appearance and motion

stream separately. For further information on network

training, model construction please refer to (Feichten-

hofer et al., 2016a).

3.2 Feature Extraction using

Spatiotemporal Residual Layers

(ST-ResNet)

We also make use of spatio-temporal residual network

(ST-ResNet) (Feichtenhofer et al., 2016a) to extract

the appearance and motion features. This network ar-

chitecture integrates the motion information in spatial

stream by adding a residual connection between both

streams. To learn spacetime features they also inclu-

ded temporal convolution in residual units. We trun-

cate the network at pooling layer after last convoluti-

onal block to obtain 2048 for both streams. For furt-

her information on network training and model con-

struction please refer to (Feichtenhofer et al., 2016a).

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

422

Table 1: Accuracy (%) for HMDB-51 dataset using residual network (ResNet).

Split RGB Flow RGB + Flow RGB + Flow + iDT

SoftMax MKL SVM SoftMax MKL SVM SoftMax MKL SVM SoftMax MKL SVM

1 49.0% 50.3% 57.7% 63.5% 63.3% 67.5% 67.5% 70.1%

2 48.7% 48.3% 54.6% 59.3% 61.8% 65.0% 67.7% 67.1%

3 49.6% 48.0% 56.5% 61.8% 61.8% 65.8% 65.6% 67.9%

Average 49.1% 48.9% 56.3% 61.5% 62.3% 66.1% 66.9% 68.4%

Table 2: Accuracy (%) for HMDB-51 dataset using spatio-temporal residual network (ST-ResNet).

Split RGB Flow RGB + Flow RGB + Flow + iDT

SoftMax MKL SVM SoftMax MKL SVM SoftMax MKL SVM SoftMax MKL SVM

1 56.2% 56.6% 58.8% 61.7% 68.0% 67.6% 69.1% 68.3%

2 56.8% 57.1% 59.8% 62.1% 66.0% 65.8% 66.2% 66.3%

3 55.5% 56.4% 59.8% 65.2% 66.3% 67.0% 66.7% 67.1%

Average 56.2% 56.7% 59.5% 63.0% 66.8% 66.8% 67.3% 67.2%

3.3 Feature Extraction using Temporal

Residual Layers (T-ResNet)

We also use the temporal residual network architec-

ture (T-ResNet) for the extraction of appearance and

motion features. For appearance, we utilized the ar-

chitecture proposed in (Feichtenhofer et al., 2017b)

and build the spatiotemporal residual network on Res-

Net50 (He et al., 2016) which is pre-trained on Ima-

geNet. In the next stage, we inject the temporal in-

formation in this spatiotemporal residual unit. For

the ﬂow extraction, we utilized the model provided

in (Feichtenhofer et al., 2016a) and inject the tem-

poral connection in spatiotemporal residual network

which is pre-trained on the HMDB-51 dataset. In con-

trast to (Feichtenhofer et al., 2017b), where they fu-

sed the prediction of both networks, we truncate both

networks at pooling layer after conv 5 to get the ap-

pearance and ﬂow features representation separately.

For further information on network training and mo-

del construction please refer to (Feichtenhofer et al.,

2017b).

3.4 Support Vector Machine with

Multi-kernel Learning

Given training tuples (x

) and weights w, under a

Hinge loss, an SVM solves the primal problem [19],

min

w,b,ϑ

∑

i=1

s.t.y



φ(x

) + b



≥ 1 − ϑ

ϑ ≥ 0, i = 1, . . . ., n

(1)

As is customary in kernel methods, computati-

ons involving φ are handled using kernel functions

k (x

) = φ(x

).φ(x

). In all our experiments, a

Radial Basis Function (RBF) based kernel has been

used. C (ﬁxed at 1) is the penalty parameter and ϑ is

the slack variable. For multiple kernel learning (we

follow the recipe by (Sonnenburg et al., 2006), and

formulate a convex combination of sub-kernels as,

k (x

) =

∑

k=1

) (2)

In contrast to (Sonnenburg et al., 2006), we use L2

regularized β

≥ 0 and

∑

k=1

≤ 1. L2 regularised

multiple kernel is learnt by formulating eqn. 2 as

a semi-inﬁnite linear programming (SILP) problem.

During each iteration, an SVM solver is ﬁrst instan-

tiated to obtain the weighted support vectors; subse-

quently, a linear programming (LP) problem is solved

using Mosek (mos, 2012).

4 EVALUATION

4.1 Dataset

We used the HMDB-51 (Kuehne et al., 2011) hu-

man action recognition dataset for the evaluation of

proposed method. HMDB-51 is a large-scale bench-

mark dataset containing 7k videos. It is divided into

training, testing and validation sets containing 3570,

1530 and 1749 videos respectively. These videos

are taken from digital movies and different public vi-

deo databases like YouTube. This dataset is captu-

red in realistic environment and comparatively more

challenging than the other benchmark datasets like

UCF101. For evaluation we used the standard evalu-

ation measures and report the performance as average

accuracy over three splits. All experiments were done

on Intel Xeon E5-2687W 3 GHz 128 GB workstation

with two 12GB nVIDIA TITAN Xp GPUs.

Human Action Recognition using Multi-Kernel Learning for Temporal Residual Network

423

Table 3: Accuracy (%) for HMDB-51 dataset using temporal residual network (T-ResNet).

Split RGB Flow RGB + Flow RGB + Flow + iDT

SoftMax MKL SVM SoftMax MKL SVM SoftMax MKL SVM SoftMax MKL SVM

1 51.3% 53.9% 64.2% 64.9% 67.2% 70.1% 67.9% 71.1%

2 51.5% 53.2% 64.6% 65.0% 68.6% 69.1% 69.4% 69.8%

3 49.9% 51.9% 65.8% 65.2% 68.6% 68.9% 68.8% 69.9%

Average 50.9% 53.0% 64.9% 65.0% 68.1% 69.4% 68.7% 70.3%

Table 4: Comparison with the state-of-the-art methods for HMDB-51 dataset.

Method Results (Accuracy %) Reference

Two-stream 59.4 (Simonyan and Zisserman, 2014a)

Rank Pooling (ALL)+ HRP (CNN) 65.0 (Fernando and Gould, 2017)

Convolutional Two-stream (without/with iDT) 65.4/69.2 (Feichtenhofer et al., 2016b)

Temporal-Inception 67.5 (Ma et al., 2017)

Temporal Segment Network (2/3 modalities) 68.5/69.4 (Wang et al., 2016)

TS-LSTM 69 (Ma et al., 2017)

ST-ResNet(without/with iDT) 66.4/70.3 (Ma et al., 2017)

ST-multiplier network(without/with iDT) 68.9/72.2 (Feichtenhofer et al., 2017a)

Two-Stream I3D on split-1 66.4 (Carreira and Zisserman, 2017)

Two-Stream I3D with Kinetics pre-training 80.7 (Carreira and Zisserman, 2017)

Temporal ResNet 67.2 (Feichtenhofer et al., 2017b)

MKL SVM ResNet (RGB+Flow+iDT) 68.4 Our

MKL SVM ST-ResNet (RGB+Flow+iDT) 67.2 Our

MKL SVM T-ResNet (RGB+Flow+iDT) 70.3 Our

4.2 Results and Discussion

In this section we present the results on three different

residual networks by comparing the performance for

action classiﬁcation using SoftMax and multi-kernel

support vector machine. For multiple network sco-

res (the output of SoftMax layer) fusion, we used a

1:1 weighting scheme. These scores were taken from

the respective models. For MKL SVM, we tuned the

parameters on validation dataset and chose L2-norm

as it always performed better than L1-norm. We ﬁxed

penalty parameter C=1 as it shows better performance

than C=10 or 100. Along with appearance and mo-

tion features, we have also considered the iDT fea-

tures which are computed by following the method

described in (Wang et al., 2013).

As the base representation of our proposed met-

hods, we used ResNet50 (Feichtenhofer et al., 2016a)

trained on HMDB-51 dataset. Table 1 reports the

performance of ResNet50 over 3 splits. ResNet trai-

ned on ﬂow information shows better performance as

compared to the ResNet trained on RGB information.

By combing both RGB and ﬂow information we get

a better result as compared to individual network. We

compared the performance of Multi-kernel SVM and

SoftMax using RGB+ﬂow and RGB+ﬂow+iDT infor-

mation and results shows the MKL

SVM has 4% and

2% increase in performance as compared to SoftMax

respectively.

Similiarly, Table 2 reports the spatio-temporal

residual network performance for HMDB-51 data-

set. We used the ST-ResNet50 (Feichtenhofer et al.,

2016a) model trained for HMDB-51 dataset for both

appearance and ﬂow extraction. Multiple kernel le-

arning provides better performance as compared to

SoftMax specially when network is trained using ﬂow

information. However in case of learning three ker-

nels for RGB, ﬂow and iDT features, MKL couldn’t

outperform SoftMax.

Table 3 shows the performance of temporal resi-

dual network for RGB, ﬂow and iDT features. We

trained the T-ResNet (Feichtenhofer et al., 2017b)

using RGB and ﬂow information. MKL SVM always

show better performance than SoftMax when multi-

kernel learning is applied on features obtained from

temporal residual network.

By comparing the performance of three residual

networks i.e. ResNet, ST-ResNet and T-ResNet, we

can conclude that using temporal residual network (T-

ResNet) that integrates the temporal domain informa-

tion in basic residual unit of ResNet, it is beneﬁcial

for action recognition where the importance of tem-

poral information along with spatial information in

non-negligible.

We conclude the evaluation section with the com-

parison of our proposed methods with state-of-the-

art deep learning methods. As shown in Table 4.

multi-kernel learning SVM approach boost the re-

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

424

sults and provide state-of-the-art performance for

HMDB-51 dataset. Feichtenhofer et al. (Feichten-

hofer et al., 2017a) used ResNet-50 for appearance

and ResNet-152 for motion in their proposed archi-

tecture for HMDB-51 dataset. They concluded that

for the HMDB-51 dataset deeper appearance network

degrades performance while for deeper motion net-

work sizable gain is observed. The performance of

our proposed network can also be improved by using

ResNet-152 for motion stream. The highest accuracy

shown in Table 4. was released by DeepMind in July

2017 which uses 240K Kinetics dataset pre-training

with end-to-end ﬁnetuning on HMDB-51 (Carreira

and Zisserman, 2017), without Kinetics pretraining

model, their proposed two-stream I3D shows 66.4%

accuracy on the split-1 of HMDB-51. The better per-

formance is explainable by the better quality of Kine-

tics dataset that is used for pre-training the network.

5 CONCLUSION

In this paper we proposed a new method to improve

the performance for human action recognition. The

proposed method is based on an existing state-of-the-

art deep learning network, T-ResNet, which recently

shows noticeable performance for the task of human

action recognition task. We utilized the prevalent va-

riations of residual network for the extraction of mo-

tion and appearance features for benchmark dataset.

We extracted features from popular ResNet, spatio-

temporal ResNet and temporal ResNet. We evaluated

multi-kernel learning approach in comparison with

SoftMax and experiments shows that MKL SVM al-

ways outperforms SoftMax. In future, we can train

the state-of-the-art residual networks on larger data-

sets like Kinetics600 to improve the performance for

action recognition.

ACKNOWLEDGEMENTS

Sergio A Velastin has received funding from the Uni-

versidad Carlos III de Madrid, the European Unions

Seventh Framework Programme for research, techno-

logical development and demonstration under grant

agreement n

◦

600371, el Ministerio de Econom

ıa, In-

dustria y Competitividad (COFUND2013-51509) el

Ministerio de Educaci

on, cultura y Deporte (CEI-15-

17) and Banco Santander. Authors also acknowledge

support from the Higher Education Commission, Pa-

kistan.

REFERENCES

(2012). Mosek toolbox. http://docs.mosek.com/6.0/

toolbox/]. Last accessed on: 2018-06-30.

Carreira, J. and Zisserman, A. (2017). Quo vadis, action

recognition? a new model and the kinetics dataset.

In Computer Vision and Pattern Recognition (CVPR),

2017 IEEE Conference on, pages 4724–4733. IEEE.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,

L. (2009). Imagenet: A large-scale hierarchical image

database. In Computer Vision and Pattern Recogni-

tion, 2009. CVPR 2009. IEEE Conference on, pages

248–255. Ieee.

Feichtenhofer, C., Pinz, A., and Wildes, R. (2016a). Spa-

tiotemporal residual networks for video action recog-

nition. In Advances in neural information processing

systems, pages 3468–3476.

Feichtenhofer, C., Pinz, A., and Wildes, R. P. (2017a). Spa-

tiotemporal multiplier networks for video action re-

cognition. In 2017 IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), pages 7445–

7454. IEEE.

Feichtenhofer, C., Pinz, A., and Wildes, R. P. (2017b). Tem-

poral residual networks for dynamic scene recogni-

tion. In CVPR, volume 1, page 2.

Feichtenhofer, C., Pinz, A., and Zisserman, A. (2016b).

Convolutional two-stream network fusion for video

action recognition. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition,

pages 1933–1941.

Fernando, B. and Gould, S. (2017). Discriminatively lear-

ned hierarchical rank pooling networks. International

Journal of Computer Vision, 124(3):335–355.

Hara, K., Kataoka, H., and Satoh, Y. (2017). Learning

spatio-temporal features with 3d residual networks for

action recognition. In Proceedings of the ICCV Works-

hop on Action, Gesture, and Emotion Recognition, vo-

lume 2, page 4.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resi-

dual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Herath, S., Harandi, M., and Porikli, F. (2017). Going dee-

per into action recognition: A survey. Image and vi-

sion computing, 60:4–21.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).

Imagenet classiﬁcation with deep convolutional neu-

ral networks. In Advances in neural information pro-

cessing systems, pages 1097–1105.

Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre,

T. (2011). Hmdb: a large video database for human

motion recognition. In Computer Vision (ICCV), 2011

IEEE International Conference on, pages 2556–2563.

IEEE.

Laptev, I. (2005). On space-time interest points. Internati-

onal journal of computer vision, 64(2-3):107–123.

Ma, C.-Y., Chen, M.-H., Kira, Z., and AlRegib, G. (2017).

Ts-lstm and temporal-inception: Exploiting spatio-

temporal dynamics for activity recognition. arXiv pre-

print arXiv:1703.10667.

Human Action Recognition using Multi-Kernel Learning for Temporal Residual Network

425

Nazir, S., Yousaf, M. H., Nebel, J.-C., and Velastin, S. A.

(2018). A bag of expression framework for improved

human action recognition. Pattern Recognition Let-

ters, 103:39–45.

Pham, H.-H., Khoudour, L., Crouzil, A., Zegers, P., and

Velastin, S. A. (2018). Exploiting deep residual net-

works for human action recognition from skeletal

data. Computer Vision and Image Understanding.

Scovanner, P., Ali, S., and Shah, M. (2007). A 3-

dimensional sift descriptor and its application to

action recognition. In Proceedings of the 15th ACM

international conference on Multimedia, pages 357–

360. ACM.

Simonyan, K. and Zisserman, A. (2014a). Two-stream

convolutional networks for action recognition in vi-

deos. In Advances in neural information processing

systems, pages 568–576.

Simonyan, K. and Zisserman, A. (2014b). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Sonnenburg, S., R

atsch, G., Sch

afer, C., and Sch

olkopf, B.

(2006). Large scale multiple kernel learning. Journal

of Machine Learning Research, 7(Jul):1531–1565.

Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., and

Paluri, M. (2018). A closer look at spatiotemporal

convolutions for action recognition. In Proceedings of

the IEEE Conference on Computer Vision and Pattern

Recognition, pages 6450–6459.

Wang, H., Kl

aser, A., Schmid, C., and Liu, C.-L. (2011).

Action recognition by dense trajectories. In Computer

Vision and Pattern Recognition (CVPR), 2011 IEEE

Conference on, pages 3169–3176. IEEE.

Wang, H., Kl

aser, A., Schmid, C., and Liu, C.-L. (2013).

Dense trajectories and motion boundary descriptors

for action recognition. International journal of com-

puter vision, 103(1):60–79.

Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X.,

and Van Gool, L. (2016). Temporal segment networks:

Towards good practices for deep action recognition. In

European Conference on Computer Vision, pages 20–

36. Springer.

Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S.,

Vinyals, O., Monga, R., and Toderici, G. (2015).

Beyond short snippets: Deep networks for video clas-

siﬁcation. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 4694–

4702.

Zeiler, M. D. and Fergus, R. (2014). Visualizing and under-

standing convolutional networks. In European confe-

rence on computer vision, pages 818–833. Springer.

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

426