Applying Deep Learning Models to Action Recognition of Swimming

Mice with the Scarcity of Training Data

Ngoc Giang Nguyen

, Mera Kartika Delimayanti

1,2

, Bedy Purnama

1,3

, Kunti Robiatul Mahmudah

Mamoru Kubo

, Makiko Kakikawa

, Yoichi Yamada

and Kenji Satou

Department of Electrical Engineering and Computer Science, Kanazawa University, Kanazawa, Japan

Department of Computer and Informatics Engineering, Politeknik Negeri Jakarta, Jakarta, Indonesia

Telkom School of Computing, TELKOM University, Bandung, Indonesia

Institute of Science and Engineering, Kanazawa University, Kanazawa, Japan

Keywords: Swimming Mouse Behaviour Recognition, Deep Learning, Transfer Learning, Data Scarcity.

Abstract: Deep learning models have shown their ability to model complicated problems in more efficient ways than

other machine learning techniques in many application fields. For human action recognition tasks, the

current state-of-the-art models are deep learning models. But they are not well-studied in applying for

animal behaviour recognition due to the lack of data required for training these models. Therefore, in this

research, we proposed a method to apply deep learning models to recognize the behaviours of a swimming

mouse in two mouse forced swim tests with a limited amount of training data. We used deep learning

models which are used in human action recognition tasks and fine-tuned them on the largest publicly

available mouse behaviour dataset to give the models the knowledge about mouse behaviour recognition

tasks. Then we fine-tuned the models one more time using the small amount of data that we have annotated

for our swimming mouse behaviour recognition tasks. The good performance of these models in the new

tasks proved the efficiency of our approach.

1 INTRODUCTION

We have been using many animals from mice, fish

to primates to study biology, psychology or develop

new types of medicines. In these researches, to

answer questions such as how animals behave in

specific environments or how their behaviours

change after using a new drug, we have to watch and

annotate many hours of their recorded videos. These

tasks are time-consuming and costly but essential for

the researches. Another problem is that human

assessments are not always consistent, so the

experiments become harder to reproduce. Therefore,

we need automated animal behaviour recognition

systems to delegate these frustrating works to

computers which are less erroneous and more

consistent in their assessments.

The works presented in (Jhuang et al., 2010) and

(Jiang et al., 2017) both proposed automated mouse

behaviour recognition systems based on the same

approach. First, relying on expert knowledge, they

created feature extractors to extract important

information from input data, such as the relative

position of the mouse and some specific points in the

scene (water faucet or food gate). Then they used the

extracted features and a classifier to predict the

behaviours of mice. One drawback of this approach

is that the feature extractors are only designed for

specific experiments. For this reason, we can not

apply these systems for our experiments unless we

exactly mimic the setups of their experiments or

redesign the feature extractors and retrain the

classifiers from scratch. We can avoid this drawback

by using deep learning models to create our systems.

Deep learning models are classifiers with built-in

feature extractors. Thus, these models can learn to

extract important features from input data without

any requirement of expert knowledge. The current

state-of-the-art models in human action recognition

tasks are deep learning models. However, to achieve

those high performances, these deep learning models

have to learn the actions from thousands of minutes

of annotated videos. Unfortunately, we do not

usually have such extensive annotated datasets

270

Nguyen, N., Delimayanti, M., Purnama, B., Mahmudah, K., Kubo, M., Kakikawa, M., Yamada, Y. and Satou, K.

Applying Deep Learning Models to Action Recognition of Swimming Mice with the Scarcity of Training Data.

DOI: 10.5220/0007567602700275

In Proceedings of the 12th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2019), pages 270-275

ISBN: 978-989-758-353-7

available to train these models for our specific

experiments.

In order to palliate the issue related to training

data scarcity, we proposed a method that does not

require a large amount of training data to apply deep

learning models to mouse behaviour recognition

tasks. In our method, we utilize deep learning

models pre-trained from human action recognition

tasks. First, we retrain these models using the largest

of the current publicly available mouse behaviour

datasets (Jhuang et al., 2010). We use this step to

give the models the knowledge about mouse

behaviour recognition tasks. Then, we use data of

our specific tasks to fine-tune these models. Because

the retrained models after the first step have learned

knowledge about mouse behaviours, we do not need

a large amount of data to train them for our specific

tasks in the second step.

In the next section, we describe the deep learning

models and the mouse behaviour dataset we used in

the first step of our proposed method. In Section 3,

we present the swimming mouse behaviour

recognition tasks we used to evaluate our method

and the results of our experiments. Finally, in

Section 4, we state our conclusions.

2 METHOD

As described in the previous section, our proposed

method has two steps. In the first step, we fine-tune

deep learning models which were used for human

action recognition tasks by using the largest publicly

available mouse behaviour dataset. Then in the

second step, we train these models again using the

data we prepared for our swimming mouse

behaviour recognition tasks. In this section, we give

the information about the deep learning models and

the mouse behaviour dataset we used in the first step

of our method.

2.1 The Two-Stream I3d Model

Carreira and Zisserman introduced the Two-Stream

Inflated 3D ConvNets (Two-Stream I3D model)

(Carreira and Zisserman, 2018), one of the current

state-of-the-art deep learning models for human

action recognition tasks. As reported, the Two-

Stream I3D models achieve 98% of accuracy on

UCF-101 human action recognition dataset

(Soomro, Zamir and Shah, 2012) and 80.9% of

accuracy on HMDB-51 human action recognition

dataset (Kuehne et al., 2011). These models are

derived from the Inception-V1 model which uses the

Inception module architecture (Szegedy et al., 2015).

Layers of the Inception modules combine filters of

different sizes and pooling kernels to utilize all their

good effects in feature extraction.

To create an I3D model, all 2D filters and

pooling kernels of an Inception-V1 model are

inflated to 3D by endowing them with an additional

temporal dimension, i.e. n × n filters become n × n ×

n filters, and the weights of the 3D filters are

bootstrapped by repeating the weights of the

respective 2D filters n times along the new temporal

dimension. This bootstrap method let the I3D

models benefit from the learned parameters of the

pre-trained 2D models.

In this research, we used the same I3D models’

architectures as described in the paper of Carreira

and Zisserman (Carreira and Zisserman, 2018). The

models were pre-trained on ImageNet data

(Russakovsky et al., 2015) for the first step of our

method. Also reported in the research of Carreira

and Zisserman, using optical flow data computed

from RGB data to train a complementary model for

the model trained on RGB data can help to improve

the prediction accuracy. Therefore, in this research,

we also utilized optical flow data, and we

experimented on various fusion ratios of RGB data

trained models and optical flow trained models to

find the best fusion ratio for the swimming mouse

behaviour recognition tasks. To compute optical

flow data from the RGB data we used the TV-L1

algorithm (Zach, Pock and Bischof, 2007).

2.2 The Mouse Behaviour Dataset

In the work of Jhuang H. et al. (Jhuang et al., 2010),

they created a dataset to train their mouse behaviour

recognition system. They have recorded and

annotated more than 9000 video clips (~10 hours of

video) of single housed mice in a home cage. There

are 8 types of behaviour annotated in this dataset:

“drink”, “eat”, “groom”, “hang”, “micro-

movement”, “rear”, “rest” and “walk”.

From the recorded video clips, they selected

4,200 clips (~2.5 hours of video) that have the most

unambiguous examples of each behaviour to create a

subset called “clipped database”. In this research, we

used this subset for the first step of our method.

3 EXPERIMENTS & RESULTS

The mouse forced swim tests are rodent behavioural

tests used to study about antidepressant drugs,

antidepressant efficacy of new compounds, and

Applying Deep Learning Models to Action Recognition of Swimming Mice with the Scarcity of Training Data

271

experimental manipulations aimed at preventing

depressive-like states as described in the research of

Can A. et al. (Can et al., 2012). In a forced swim

test, a mouse is placed in a transparent cylinder

which is filled with water for a certain period of

time. Naturally, when being placed in such a

dangerous environment, the mouse will become

panic and try to escape. But if the mouse has taken

some antidepressant drugs before, it will be less

panic than a mouse which hasn’t taken any drug.

Therefore, by measuring the duration of each

behaviour of the mouse, such as mobile behaviour

and immobile behaviour, we can estimate the effect

of the drug in the mouse.

Figure 1: Example scenes of the side view data and the top

view data.

To evaluate the efficiency of our method, we

annotated two videos from mouse forced swim tests,

each video has a length of about 5 minutes. The first

video was recorded from a side view. For this side

view video we annotated three behaviours: “swim”,

“struggle” and “float”. The difference between

“struggle” behaviour and “swim” behaviour is that

when a mouse is struggling, it just slightly moves

one of its feet. When a mouse is floating, it is

immobile. The second video was recorded from a

top view, and for this video, we annotated two

behaviours: “mobile” and “immobile”. We showed

an example scene of each video in Figure 1.

Figure 2: Prediction accuracy in the side view data.

Figure 3: Prediction accuracy in the top view data.

BIOINFORMATICS 2019 - 10th International Conference on Bioinformatics Models, Methods and Algorithms

272

Figure 4: Confusion matrix of prediction in the side view

data.

Figure 5: Confusion matrix of prediction in the top view

data.

Applying Deep Learning Models to Action Recognition of Swimming Mice with the Scarcity of Training Data

273

In the first step of our experiments, we used

weights from I3D models’ checkpoints that were

pre-trained on ImageNet data to initialize parameters

of the model. We retrained them using the “clipped

database” dataset. For each sample data, we used 16

successive frames as an input to the I3D model

(current frame, its 8 previous frames and its 7 next

frames). To train the models, we used a momentum

optimizer with a momentum value equals to 0.9 and

a learning rate starts from 1e-3 and decays to 5e-5

after several thousands of iterations. We used the

dropout technique in fully connected layers with a

keep-probability of 36% to reduce the effect of

overfitting when we train the models.

After the first step, we have two I3D models, one

was trained on RGB data from “clipped database”

and the other was trained on optical-flow data from

“clipped database”. In the second step, we fine-tuned

the two models using RGB data and optical-flow

data from the swimming mouse behaviour data that

we have annotated. To fine-tune the models, we also

used a momentum optimizer with a momentum

value of 0.9 and a learning rate starts from 1e-3 and

decays to 5e-3. We also used the dropout technique

with a keep-probability of 36%.

To test the effect of different fusion ratios of

RGB data trained models and optical flow data

trained models, we used two parameters flow_weight

and rgb_weight, i.e. flow_weight = 0.3 then

rgb_weight = 0.7 and it means the fusion ratio is

30% of optical flow data trained models and 70% of

RGB data trained models. In our experiments we

examined flow_weight values from 0 to 1 with a

step of 0.1 and respectively with rgb_weight values

equal to 1 - flow_weight values to find out the best

fusion ratio.

To calculate the prediction accuracies of the

models, we separated the data into 10 folds and

conducted 10-fold cross-validation. The prediction

accuracies of our method for each fusion ratio in the

side view data and the top view data are shown in

Figure 2 and Figure 3.

The confusion matrices of the models in the side

view data for different fusion ratios are shown in

Figure 4. For the top view data, the confusion

matrices are shown in Figure 5.

For the side view data, the models perform well

on predicting “swim” and “float” behaviours but

have some difficulty in predicting “struggle”

behaviours. For the top view data, the models have

some problems in distinguishing “mobile” and

“immobile” behaviours because in some “immobile”

samples the water surface is still shaking as a result

of the previous “mobile” behaviour. So we may need

more data to help the models understand these cases.

In the side view swimming mouse behaviour

recognition task, our method achieved the best

performance at the fusion ratio of 40% of optical

flow data trained models and 60% of RGB data

trained models with the prediction accuracy of

92.14%. In the top view swimming mouse behaviour

recognition task, the best performance of our method

is 85.38% of prediction accuracy with the fusion

ratio of 70% of optical flow data trained models and

30% of RGB data trained models.

4 CONCLUSIONS

In this research, we have proposed a method to

apply deep learning models to mouse behaviour

recognition tasks to achieve high prediction

accuracies without the requirement of a large

amount of training data. The results of the

experiments proved the efficiency of our approach.

With this approach, we will attempt to create a

framework to apply deep learning models to general

mouse behaviour recognition tasks and also for other

specific mouse related experiments.

In further researches, we will develop our

method to apply deep learning models to other

animal behaviour recognition tasks.

ACKNOWLEDGEMENTS

In this research, the super-computing resource was

provided by Human Genome Center, the Institute of

Medical Science, the University of Tokyo.

Additional computation time was provided by the

super computer system in Research Organization of

Information and Systems (ROIS), National Institute

of Genetics (NIG). This work was supported by

JSPS KAKENHI Grant Number JP18K11525.

REFERENCES

Jhuang, H., Garrote, E., Yu, X., Khilnani, V., Poggio, T.,

Steele, A. D., Sere, T., 2010. Automated home-cage

behavioural phenotyping of mice. In Nature

communications.

Jiang, Z., Crokes, D., Green, B.D., Zhang, S., Zhou, H.,

2017. Behaviour recognition in mouse videos using

contextual features encoded by spatial-temporal

stacked Fisher vectors. In Proceeding of International

Conference on Pattern Recognition Applications and

Methods.

BIOINFORMATICS 2019 - 10th International Conference on Bioinformatics Models, Methods and Algorithms

274

Carreira, J., Zisserman, A., 2018. Quo Vadis, Action

Recognition? A New Model and the Kinetics Dataset.

In Proceeding of IEEE Conference on Computer

Vision and Pattern Recognition.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.

Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich,

A., 2015. Going Deeper with Convolutions. In

Proceeding of IEEE Conference on Computer Vision

and Pattern Recognition.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh,

S., Ma, S., Huang, Z., Karpathy, A., Khosla, A.,

Bernstein, M., Berg, A. C., Fei, L. F., 2015. ImageNet

Large Scale Visual Recognition Challenge. In

International Journal of Computer Vision.

Can, A., Dao, T. D., Arad, M., Terrillion, E. C.,

Piantadosi, C. S., Gould, D. T., 2012. The Mouse

Forced Swim Test. In Journal of Visualized

Experiments: JoVe.

Zach, C., Pock, T., Bischof, H., 2007. A duality based

approach for realtime TV-L1 optical flow. In

Proceeding of 29th DAGM Symposium of Pattern

Recognition.

Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.,

2011. HMDB: a large video database for human

motion recognition. In Proceeding of the International

Conference on Computer Vision.

Soomro, K., Zamir, A. R., Shah, M., 2012. UCF101: a

dataset of 101 human actions classes from videos in

the wild. In Proceeding of IEEE Conference on

Computer Vision and Pattern Recognition.

Applying Deep Learning Models to Action Recognition of Swimming Mice with the Scarcity of Training Data

275