In order to solve these challenges, a face online-
updated tracking algorithm aiming at integrating it
into an automated infant monitoring system is pro-
posed. This algorithm is designed based on com-
bining a GOTURN tracker (Held et al., 2016) and a
YOLO pre-trained infant face detector (Redmon et al.,
2016). Compared to state-of-the-art tracking methods
using reinforcement learning (Yun et al., 2017), the
architecture of GOTURN is adopted for its simplicity
and low computation requirements. The research in
this paper presents the following contributions.
1. The proposed tracker applies an online-updating
technique which combines GOTURN and the
YOLO tiny face detector. This novel combina-
tion has proven to outperform the individual sin-
gle components and other state-of-the-art trackers
when being used for the infant monitoring appli-
cation.
2. In order to thoroughly validate the performance of
the proposed tracker, a clinical dataset captured
from a local hospital and a consumer-oriented
dataset collected from Youtube are used for evalu-
ation purposes.
3. The experimental results of the proposed system
can achieve an execution speed comparable with
state-of-the-art tracking methods based on corre-
lation filtering.
This paper is organized as follows. Section 2 intro-
duces related work on several state-of-the-art track-
ing methods. Section 3 describes the proposed infant
face tracking algorithm. The tracking accuracy and
computation costs are evaluated in Section 4. Finally,
conclusions are presented in Section 5.
2 RELATED WORK
For decades, researchers have paid significant atten-
tion to object and face tracking. Tracking algo-
rithms can be categorized into conventional meth-
ods based on template matching (Comaniciu and
Meer, 2002), Bayesian inference (Van Der Merwe
et al., 2001) (Welch et al., 1995), correlation filter-
based tracking (Bolme et al., 2010) (Danelljan et al.,
2016a) (Danelljan et al., 2016b) (Henriques et al.,
2014), and CNN-based tracking (Nam and Han,
2016) (Li et al., 2018a) (Yun et al., 2017). Conven-
tional methods based on template matching usually
heuristically search objects according to pre-defined
templates. However, it has been demonstrated that
such techniques only perform well for a tracking tar-
get that can be represented by a simple feature model
(such as a ball with a unified color). Besides, these
trackers are likely to fail when obstacles occur near
the tracking target, which hampers their application
for complex tracking tasks.
Tracking methods based on correlation filtering
have become prevalent for addressing the drawbacks
of template-matching methods and apply a Fourier
Transform (FT) on both the target template (object
of interest) and search regions. In (Danelljan et al.,
2016b), the target template and search regions are rep-
resented by layers of features such as in one or more
CNNs. After this, a confidence map is calculated
from the FT of the template and the search regions
with a convolutional operation. Finally, the track-
ing target location in the search area is determined
with the highest confidence score based on a confi-
dence map. When tracking of the current frame is
succeeded, the target template is updated by the cur-
rent detection. However, this type of tracker lacks the
robustness of tracking the deformable objects, thereby
being less suited for infant face tracking.
Tracking methods based on CNN features have
shown a great success in recent years (Held et al.,
2016) (Nam and Han, 2016). The outstanding per-
formance of CNN-based tracking is explained by the
strong representation ability of CNN features. These
methods can be divided into two categories: (a) track-
ing in an off-line mode (Held et al., 2016) (Li et al.,
2018a) and (b) tracking with an on-line updating
scheme (Nam and Han, 2016) (Yun et al., 2017).
The framework of these CNN-based trackers accept
a target template and search regions as inputs of their
CNN architectures. Compared to conventional track-
ing methods and correlation filter-based trackers, ob-
ject tracking obtained by CNNs is realized by regres-
sion of CNN features instead of optimization. How-
ever, the aforementioned online-updated CNN track-
ers are designed to be generic for various objects,
which lack the specificity for tracking infant faces,
thereby making it difficult to use directly in an in-
fant monitoring system. This paper proposes a CNN-
based online-updating tracking method specifically
targeting at infant faces, which combines GOTURN
and the YOLO tiny face detector. The CNN features
used for tracking can also be shared with infant ex-
pression analysis, and are therefore compatible with
a general infant monitoring system and other generic
human-machine interaction applications.
3 SYSTEM DESIGN
This section discusses the architecture of the proposed
online-updating tracking method and the training pro-
cedure in more detail.
VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications
882