Dual CNN-based Face Tracking Algorithm for an Automated Infant

Monitoring System

Cheng Li

, Genyu Song, A. Pourtaherian

and P. H. N. de With

Eindhoven University of Technology, Eindhoven, The Netherlands

Keywords:

Face Tracking, Online Updating, Infant Monitoring, Real-time Application.

Abstract:

Face tracking is important for designing a surveillance system when facial features are used as main descrip-

tors. In this paper, we propose an on-line updating face tracking method, which is not only suitable for speciﬁc

tasks, such as infant monitoring, but also a generic human-machine interaction application where face recog-

nition is required. The tracking method is based on combining the architecture of the GOTURN and YOLO

tiny face detector, which enables the tracking model to be updated over time. Tracking of objects is realized by

analyzing two neighboring frames through a deep neural network. On-line updating is achieved by comparing

the tracking result and face detection obtained from the YOLO tiny face detector. The experimental results

have shown that our proposed tracker achieves an AUC of 97.9% for precision plot and an AUC of 91.8% for

success plot, which outperforms other state-of-the-art tracking methods when used in the infant monitoring

application.

1 INTRODUCTION

Monitoring the comfort states of young infants for

certain disease diagnosis at a hospital is very im-

portant, because of their limited verbal ability to ex-

press their feelings. For better understanding of infant

behavior and moments of pain/discomfort, an auto-

mated video-based infant monitoring system can be

implemented by analyzing expressions as an auxil-

iary assessment tool. State-of-the-art infant monitor-

ing systems normally consist of three components,

face detection, landmark localization and discomfort

detection (Zamzmi et al., 2016) (Sun et al., 2018),

which is depicted in Fig.1. Since the system is com-

posed of cascaded stages, the ﬁnal discomfort detec-

tion is strongly dependent on the accuracy of preced-

ing stages, e.g. face detection. However, face detec-

tion over the entire image is computationally expen-

sive, which hampers the possibility of a real-time ap-

plication. Therefore, a face tracking method is com-

monly used for video analysis when a face is detected.

When using face tracking, several challenges

for infant monitoring systems should be considered.

First, most state-of-the-art trackers are designed for

https://orcid.org/0000-0003-2900-637X

https://orcid.org/0000-0003-4542-1354

https://orcid.org/0000-0002-7639-7716

4. Discomfort

Detection

3. Landmark

Localization

2. Face

Tracking

1. Face

Figure 1: Block diagram of an infant discomfort detec-

tion/monitoring system with a face tracking step.

pedestrians, humans or vehicles, of which textures

typically undergo limited deformations and distor-

tions over time. However, infant expressions and

their head poses can change dramatically in adjacent

frames due to unexpected stimuli and fast head move-

ments, causing signiﬁcant texture changes compared

to the tracking target. Hence, a generic object tracker

will fail to track infant faces when such situations

occur. Second, infants are normally monitored un-

der a complex environment, such as interacting with

their nearby patients. Therefore, multiple object in-

stances (faces of infants and parents) can coexist in

the search regions, which increases the ambiguity for

the designed object tracker. Last, some of state-of-

the-art trackers based on CNNs adopt complex ar-

chitectures, which require a considerable amount of

computations, thereby signiﬁcantly increasing the im-

plementation burden of a real-time infant monitoring

application.

Li, C., Song, G., Pourtaherian, A. and N. de With, P.

Dual CNN-based Face Tracking Algorithm for an Automated Infant Monitoring System.

DOI: 10.5220/0010384308810887

In Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2021) - Volume 5: VISAPP, pages

881-887

ISBN: 978-989-758-488-6

881

In order to solve these challenges, a face online-

updated tracking algorithm aiming at integrating it

into an automated infant monitoring system is pro-

posed. This algorithm is designed based on com-

bining a GOTURN tracker (Held et al., 2016) and a

YOLO pre-trained infant face detector (Redmon et al.,

2016). Compared to state-of-the-art tracking methods

using reinforcement learning (Yun et al., 2017), the

architecture of GOTURN is adopted for its simplicity

and low computation requirements. The research in

this paper presents the following contributions.

1. The proposed tracker applies an online-updating

technique which combines GOTURN and the

YOLO tiny face detector. This novel combina-

tion has proven to outperform the individual sin-

gle components and other state-of-the-art trackers

when being used for the infant monitoring appli-

cation.

2. In order to thoroughly validate the performance of

the proposed tracker, a clinical dataset captured

from a local hospital and a consumer-oriented

dataset collected from Youtube are used for evalu-

ation purposes.

3. The experimental results of the proposed system

can achieve an execution speed comparable with

state-of-the-art tracking methods based on corre-

lation ﬁltering.

This paper is organized as follows. Section 2 intro-

duces related work on several state-of-the-art track-

ing methods. Section 3 describes the proposed infant

face tracking algorithm. The tracking accuracy and

computation costs are evaluated in Section 4. Finally,

conclusions are presented in Section 5.

2 RELATED WORK

For decades, researchers have paid signiﬁcant atten-

tion to object and face tracking. Tracking algo-

rithms can be categorized into conventional meth-

ods based on template matching (Comaniciu and

Meer, 2002), Bayesian inference (Van Der Merwe

et al., 2001) (Welch et al., 1995), correlation ﬁlter-

based tracking (Bolme et al., 2010) (Danelljan et al.,

2016a) (Danelljan et al., 2016b) (Henriques et al.,

2014), and CNN-based tracking (Nam and Han,

2016) (Li et al., 2018a) (Yun et al., 2017). Conven-

tional methods based on template matching usually

heuristically search objects according to pre-deﬁned

templates. However, it has been demonstrated that

such techniques only perform well for a tracking tar-

get that can be represented by a simple feature model

(such as a ball with a uniﬁed color). Besides, these

trackers are likely to fail when obstacles occur near

the tracking target, which hampers their application

for complex tracking tasks.

Tracking methods based on correlation ﬁltering

have become prevalent for addressing the drawbacks

of template-matching methods and apply a Fourier

Transform (FT) on both the target template (object

of interest) and search regions. In (Danelljan et al.,

2016b), the target template and search regions are rep-

resented by layers of features such as in one or more

CNNs. After this, a conﬁdence map is calculated

from the FT of the template and the search regions

with a convolutional operation. Finally, the track-

ing target location in the search area is determined

with the highest conﬁdence score based on a conﬁ-

dence map. When tracking of the current frame is

succeeded, the target template is updated by the cur-

rent detection. However, this type of tracker lacks the

robustness of tracking the deformable objects, thereby

being less suited for infant face tracking.

Tracking methods based on CNN features have

shown a great success in recent years (Held et al.,

2016) (Nam and Han, 2016). The outstanding per-

formance of CNN-based tracking is explained by the

strong representation ability of CNN features. These

methods can be divided into two categories: (a) track-

ing in an off-line mode (Held et al., 2016) (Li et al.,

2018a) and (b) tracking with an on-line updating

scheme (Nam and Han, 2016) (Yun et al., 2017).

The framework of these CNN-based trackers accept

a target template and search regions as inputs of their

CNN architectures. Compared to conventional track-

ing methods and correlation ﬁlter-based trackers, ob-

ject tracking obtained by CNNs is realized by regres-

sion of CNN features instead of optimization. How-

ever, the aforementioned online-updated CNN track-

ers are designed to be generic for various objects,

which lack the speciﬁcity for tracking infant faces,

thereby making it difﬁcult to use directly in an in-

fant monitoring system. This paper proposes a CNN-

based online-updating tracking method speciﬁcally

targeting at infant faces, which combines GOTURN

and the YOLO tiny face detector. The CNN features

used for tracking can also be shared with infant ex-

pression analysis, and are therefore compatible with

a general infant monitoring system and other generic

human-machine interaction applications.

3 SYSTEM DESIGN

This section discusses the architecture of the proposed

online-updating tracking method and the training pro-

cedure in more detail.

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

882

Search region

Crop

Resize

Target region

Conv layers

Face detector

YOLOv3 tiny

FC layers

Bbox

update

Bbox

Conv layers

Weights

Crop

Resize

Current frame

Previous frame

current frame

Tracking result of

Figure 2: Architecture of the proposed online-updated

tracker based on the GOTURN and YOLO tiny face detec-

tor.

3.1 Architecture

When differentiating from tracking for a generic pur-

pose, a target-speciﬁc tracker can be more robust

and become less a complex CNN architecture, pro-

vided that it is trained with proper datasets. For

this reason, we utilize the same structure as in GO-

TURN (Held et al., 2016), and combine it with

AlexNet (Krizhevsky et al., 2012) because if offers

a shallow and fast network. The framework of the

target-speciﬁc on-line updated tracker is depicted in

Fig. 2. As shown, the input of the network con-

sists of a target patch and a cropped patch (search re-

gion), which are potentially containing the target to

be tracked. Here, the target is an infant face that is

detected from the previous frame, whereas the search

region is obtained through cropping the next frame.

The center of the search region is determined by the

target position, while its size is pre-deﬁned accord-

ing to numerous experiments. To fulﬁll the input

size requirement of AlexNet, both target and search

patches are resized, and then go through their indi-

vidual convolution-layer branch to obtain the respec-

tive CNN features. In order to estimate the target in

the subsequent frame, CNN feature maps of the tar-

get and search patches are concatenated, and are then

supplied into Fully Connected (FC) layers. Finally,

the target location is computed from the output of the

last FC layer.

Probably caused by the computational efﬁciency

of AlexNet, it provides less robustness when infants

heavily rotate their head in a video sequence. In this

case, the tracker can easily lose the target. To ad-

dress the non-rigid deformation caused by rapid head

movements, a YOLOv3 face detector is added for the

potential face detection in the search region to update

the network. The bounding box from the face detec-

tor is considered as ground truth for ﬁne- tuning the

network and therefore adapting the network to the in-

fant facial deformation over time. In this work, only

the weights of the last FC layer are ﬁne-tuned to allow

online updating.

3.2 Training

3.2.1 Input of Network

To address the challenge of multiple persons pre-

sented in video sequences, an image patch for search-

ing the infant face is cropped from the current frame,

denoted as F

. The center of the cropped patch is

located at the position of the detected face from the

previous frame, denoted as F

t−1

. Assuming the de-

tected bounding box containing the face has a center

c(x, y), then the width and the height of the bounding

box are represented as w and h, respectively. Then,

in the actual frame F

, the search region is cropped

with k times on the bounding box of the previous

frame, where k is a scale factor indicating the size of

the search region. Here, the value of k is empirically

determined. In order to avoid too much background

noise encompassed in the search area, we have set

k = 2 for balancing the computational overhead and

the tracking accuracy.

3.2.2 Loss Function

The loss is computed from the output of the last FC

layer and the ground-truth bounding-box coordinates.

Here, the L1 loss is chosen, denoted by L

, because

it is easy to compute and also proved to be reliable

against outliers. The loss function L is computed by:

L =

∑

− m

∗

), (1)

where m

denotes the coordinates (x

, y

, x

, y

) of

the top-left and bottom-right corners of the predicted

bounding box by the tracking network and m

∗

rep-

resents the corresponding coordinates of the ground-

truth bounding box.

3.2.3 Training Conﬁguration

Because of the limited availability of video sequences

for training, the weights of convolutional layers are

ﬁrst trained with ImageNet data (Deng et al., 2009).

After that, the whole network is ﬁne-tuned with our

training video sequences containing infants, in which

the weights of the FC layers are initialized by sam-

pling from a Gaussian distribution with zero mean and

standard deviation of 0.01. Furthermore, a Stochastic

Gradient Descent (SGD) is used for optimizing the

network parameters with a base learning rate of 0.001

and a momentum of 0.9. The ﬁnal FC layer outputs

the coordinates of the bounding box predicted by the

tracking network.

Dual CNN-based Face Tracking Algorithm for an Automated Infant Monitoring System

883

3.2.4 On-line Updating Process

In order to address the deformations of infant faces

over time, an online adaptation technique is utilized.

To maximally reduce the computational requirement,

only the weights of the last FC layer are updated. The

loss function used for updating the weights is identi-

cal to the speciﬁcation of Eq. (1), where only m

∗

rep-

resents the coordinates of the bounding box detected

from the YOLO tiny network in this stage.

4 EXPERIMENTS

In this section, we ﬁrst introduce the databases used

for training and testing for the proposed tracking

method. Then we describe the applied evaluation

metrics. Finally, the experimental results of the infant

face tracking are provided.

4.1 Database

The dataset used for this work contains 73 video

sequences in total, which can be divided into two

subsets. One subset contains 11 video sequences

that were recorded at the Maxima Medical Center

(MMC), Veldhoven, the Netherlands, whereas the

other subset consists of 62 video sequences that were

collected from YouTube. Video sequences recorded

at the MMC comply with the ethical standards, and

allow usage for experiments after obtaining a written

consent. In this subset, each video sequence lasts at

least 2 minutes for meeting the requirements for infant

pain assessment. Furthermore, the length of video se-

quences in the YouTube subset lasts from 20 seconds

to 2 minutes.

For training the tracking network, 51 video se-

quences are selected which maximally encompass a

variety of challenging situations, such as large head

poses and object occlusions. In the training video

dataset, 8 videos are selected from the MMC sub-

set and 43 videos are taken from the YouTube subset.

The rest of the video sequences are then used as test-

ing datasets. In order to thoroughly compare the pro-

posed tracking network with state-of-the-art methods,

the testing dataset also contains the challenging cases

as earlier mentioned. Table 1 provides the number of

frames for the example video sequences in the testing

dataset.

4.2 Metrics

In order to evaluate the tracking accuracy, two metrics

from the Online Tracking Benchmark (OTB) (Nam

Table 1: Number of frames for the example video sequences

for the testing dataset used for evaluation of face tracking.

Video Id No. FRM Video Id No. FRM

MMC #001 1,175 Y-T #001 329

MMC #002 1,139 Y-T #002 1,108

MMC #003 1,752 Y-T #003 1,764

and Han, 2016) are used, which are the success plot

and the precision plot. These two metrics are speciﬁ-

cally designed for evaluating and benchmarking the

overall performance of different tracking methods.

According to the deﬁnition, a success plot presents the

successful tracking rate in terms of overlap deﬁned

by the Intersection over Union (IoU) ratios between

the detected and ground-truth bounding boxes. Alter-

natively, the precision plot provides the precision of

tracking methods in terms of the difference between

the center locations of the tracked bounding box and

the ground-truth bounding box, which is indicated as

a tracking error. In addition to the tracking accuracy,

the execution speed is also compared and expressed as

frame throughput rate (fps), for indicating the possi-

bility of implementing the tracking methods in a real-

time system.

4.3 Experimental Results

4.3.1 Tracking Accuracy

In this experiment, both conventional tracking meth-

ods (Grabner et al., 2008) (Kalal et al., 2010) (Hen-

riques et al., 2014) (Kalal et al., 2010) and state-

of-the-art deep learning-based tracking methods (Li

et al., 2018b) (Held et al., 2016), are evaluated and

compared with our proposed approach. When apply-

ing the conventional methods to the testing video se-

quences, the tracking target is initialized every 100

frames regardless of the outcome of the tracker, since

they are less robust to situations when tracking targets

suddenly and signiﬁcantly deviate from their previous

position. Instead, for the deep learning-based method,

only the ﬁrst frame of each sequence is initialized and

then tracking is sustained till the end of the sequence.

Fig. 3 shows the success plot and the precision

plot for all the utilized tracking methods, evalu-

ated with our infant test video sequences. It can

be observed that the correlation ﬁlter-based tracker

(KCF) (Henriques et al., 2014) outperforms other

conventional tracking methods (MIL, Boosting and

MedianFlow) (Babenko et al., 2009) (Grabner et al.,

2008) (Kalal et al., 2010), due to its online up-

dating process of the target template. However,

the CNN-based tracking methods (GOTURN and

SiamRPN) (Held et al., 2016) (Li et al., 2018b) out-

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

884

(a) (b)

Figure 3: Performance for all the tracking methods used for comparison evaluated with the test infant datasets. (a) Precision

plot; (b) Success plot.

Table 2: Area under curve (AUC) of precision plot and suc-

cess plot evaluated with the testing infant datasets for all the

tracking methods. Boldface indicates the highest number.

Tracking methods Precision Success

Online-GOTURN 0.979 0.918

GOTURN 0.969 0.800

SiamRPN 0.927 0.765

YOLOtiny 0.838 0.640

Boosting 0.817 0.654

MIL 0.847 0.682

KCF 0.841 0.680

MedianFlow 0.768 0.630

performs the evaluated conventional methods with a

higher accuracy. It should be noted that the proposed

on-line updating tracking method achieves the high-

est performance evaluated with both the success plot

and the precision plot. Furthermore, Table 2 shows

the Area Under the Curve (AUC) values for both suc-

cess plot and precision plot. It can be readily noticed

that the proposed tracking method achieves an AUC

of 97.9% and an AUC of 91.8% for success plot and

precision plot, respectively, which performs the best

compared with all the evaluated trackers for the in-

fant face tracking application. Particularly, the pro-

posed method is compared with both the GOTURN

and the YOLO tiny face detector, where the proposed

combined tracking outperforms any of its individual

components. Fig. 4 portrays the tracking results ob-

tained by different tracking approaches when being

used for infant face tracking.

When implementing different tracking methods

for evaluation, we have noticed that KCF and MIL

can only be reliable when an infant face is frontal and

static. Unfortunately, when the infant heads signiﬁ-

cantly deviate from the frontal view, such as with a

head rotation from frontal to proﬁle view, these track-

ers are more likely to fail, due to the facial texture

changes. This failure can be explained by the aspect

that their tracking updating solution is less descrip-

tive for the texture changes between frontal and pro-

ﬁle views. However, our proposed on-line updating

tracking network is pre-trained with various views.

Furthermore, the on-line network updating technique

also allows it to model the deformations of the fa-

cial appearance between frames, which explains the

high performance of the proposed tracking method.

As a result, the high accuracy for the tracking of faces

makes the proposed tracking method suitable for inte-

grating it in a face-related human-machine-interaction

system.

4.3.2 Execution Speed

In addition to the tracking accuracy, the execution

speed is also evaluated. Table 3 presents the exe-

cution speed of each tracker of interest. It is clear

that KCF has the lowest speed as 11 fps, and Me-

dianFlow achieves the highest as 128 fps. Even

Table 3: Execution speed for each tracker operated on an

Xeon(R) CPU E5-2609 v2 @ 2.50 GHz with octal CPU

core.

Tracking method Execution speed (fps)

Boosting 27

MIL 21

KCF 11

MedianFlow 128

SiamsRPN 30

GOTURN 50

Online updated tracking 13

though the speed of MedianFlow is appealing, it lacks

the robustness against large head rotations and oc-

Dual CNN-based Face Tracking Algorithm for an Automated Infant Monitoring System

885

Frame #1 Frame #50 Frame #100 Frame #150 Frame #1 Frame #50 Frame #100 Frame #150

Figure 4: Examples of tracking results on frames within sequences from the testing dataset obtained by three CNN-based

tracking methods. Yellow boxes represent the ground truth, green boxes indicate tracking results by GOTURN, blue boxes

show the outputs from SiamsRPN and red boxes present the results obtained by our proposed method.

clusions. Compared with the conventional methods,

the CNN-based trackers generally require more com-

putation time, due to their high model complexity.

However, the CNN-based trackers are demonstrated

to provide higher robustness on discontinued texture

changes and occlusions compared to conventional

methods. Therefore, the frequency of target initializa-

tion is much lower than with conventional methods.

As shown, the speed of the proposed on-line updating

tracking is slower than SiamsRPN and GOTURN, due

to the overhead computation of online weight-factor

updating and the parallel structure of the YOLO tiny

face detection. Therefore, the proposed tracker ob-

tains high tracking accuracy at the cost of a lower

execution speed. Nevertheless, the speed of the pro-

posed tracking method is still feasible for implement-

ing an infant monitoring system. Moreover, the con-

volutional layers in the tracking network can also be

shared with the expression detection when tracking

and detection are trained in an end-to-end fashion. It

is assumed that this reduces the computation overhead

induced by the additional face detector.

5 CONCLUSIONS

In this paper, we have proposed an on-line updating

tracking method aiming at infant face tracking, based

on combining the architecture of the GOTURN and

a YOLO tiny face detector. The proposed solution

can also be reused for other human-machine inter-

action applications. The tracking position of an ob-

ject is obtained by analyzing two neighboring frames

through a deep neural network, where the network is

on-line updated by comparing the tracking output and

a detection provided by the YOLO tiny face detec-

tor. The experimental results have shown that our pro-

posed tracker achieves an AUC of 97.9% for precision

plot and an AUC of 91.8% for success plot, which

outperforms not only its single components, but also

other state-of-the-art tracking methods when used in

this application. Although the execution speed is less

competitive than other evaluated trackers, the overall

computation complexity can be further reduced when

using this tracker in an infant monitoring system or

other face-related applications, since the parameters

of the network can be shared for other detection pur-

poses. For future work, we will integrate this track-

ing method in an facial surveillance application, and

jointly train the face tracking and recognition in one

framework.

REFERENCES

Babenko, B., Yang, M.-H., and Belongie, S. (2009). Vi-

sual tracking with online multiple instance learning.

In 2009 Proc. IEEE Conf. Comput. Vis. Pattern Recog-

nit., pages 983–990. IEEE.

Bolme, D. S., Beveridge, J. R., Draper, B. A., and Lui, Y. M.

(2010). Visual object tracking using adaptive correla-

tion ﬁlters. In 2010 Proc. IEEE Conf. Comput. Vis.

Pattern Recognit., pages 2544–2550.

Comaniciu, D. and Meer, P. (2002). Mean shift: A robust

approach toward feature space analysis. IEEE Trans-

actions on Pattern Analysis & Machine Intelligence,

(5):603–619.

Danelljan, M., Bhat, G., Khan, F. S., and Felsberg, M.

(2016a). ECO: efﬁcient convolution operators for

tracking. CoRR, abs/1611.09224.

Danelljan, M., Robinson, A., Shahbaz Khan, F., and Fels-

berg, M. (2016b). Beyond correlation ﬁlters: Learning

continuous convolution operators for visual tracking.

In ECCV.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-

Fei, L. (2009). ImageNet: A Large-Scale Hierarchical

Image Database. In CVPR09.

Grabner, H., Leistner, C., and Bischof, H. (2008). Semi-

supervised on-line boosting for robust tracking. In

ECCV, pages 234–247. Springer.

Held, D., Thrun, S., and Savarese, S. (2016). Learning to

track at 100 fps with deep regression networks. In

ECCV.

Henriques, J. F., Caseiro, R., Martins, P., and Batista, J.

(2014). High-speed tracking with kernelized correla-

tion ﬁlters. IEEE Trans. Pattern Anal. Mach. Intel,

37(3):583–596.

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

886

Kalal, Z., Mikolajczyk, K., and Matas, J. (2010). Face-tld:

Tracking-learning-detection applied to faces. In 2010

IEEE International Conference on Image Processing,

pages 3789–3792. IEEE.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. In Advances in neural information process-

ing systems, pages 1097–1105.

Li, B., Yan, J., Wu, W., Zhu, Z., and Hu, X. (2018a). High

performance visual tracking with siamese region pro-

posal network. In Proc. IEEE Conf. Comput. Vis. Pat-

tern Recognit., pages 8971–8980.

Li, B., Yan, J., Wu, W., Zhu, Z., and Hu, X. (2018b). High

performance visual tracking with siamese region pro-

posal network. In Proc. IEEE Conf. Comput. Vis. Pat-

tern Recognit.

Nam, H. and Han, B. (2016). Learning multi-domain con-

volutional neural networks for visual tracking. In

2016 Proc. IEEE Conf. Comput. Vis. Pattern Recog-

nit., pages 4293–4302.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.

(2016). You only look once: Uniﬁed, real-time object

detection. In Proc. IEEE Conf. Comput. Vis. Pattern

Recognit., pages 779–788.

Sun, Y., Shan, C., Tan, T., Long, X., Pourtaherian, A.,

Zinger, S., and de With, P. H. N. (2018). Video-based

discomfort detection for infants. Mach. Vis. Appl.

Van Der Merwe, R., Doucet, A., De Freitas, N., and

Wan, E. A. (2001). The unscented particle ﬁlter. In

Advances in neural information processing systems,

pages 584–590.

Welch, G., Bishop, G., et al. (1995). An introduction to the

kalman ﬁlter.

Yun, S., Choi, J., Yoo, Y., Yun, K., and Choi, J. Y.

(2017). Action-decision networks for visual tracking

with deep reinforcement learning. In 2017 Proc. IEEE

Conf. Comput. Vis. Pattern Recognit., pages 1349–

1358.

Zamzmi, G., Pai, C., Goldgof, D., Kasturi, R., Ashmeade,

T., and Sun, Y. (2016). An approach for automated

multimodal analysis of infants’ pain. In 2016 23rd Int.

Conf. Pattern Recognit. (ICPR), pages 4148–4153.

Dual CNN-based Face Tracking Algorithm for an Automated Infant Monitoring System

887