character face angle).
From these difficulties raises many challenging is-
sues to create an efficient automated character detec-
tion system. Identifying the correct character in a
scene, from the very first appearance until the last
frame is not a trivial task. In an attempt to identify
characters presence in scenes it was necessary to test
a hybrid solution based on complementary, specific
techniques.
Here, the advantages, disadvantages and results
achieved with a developed system based on existing
and complementary methods will be presented.
This paper is organized as follow. Section 2 de-
scribes some previous and related work. Section 3
discusses the proposed method. Section 4 shows the
results. Section 5 discusses the results. Finally, Sec-
tion 6 concludes the work pointing some future direc-
tions.
2 RELATED WORK
Face recognition is a challenging task that has been
attacked for more than 2 decades. Classical ap-
proaches for videos consider video sequences where
each frame are analyzed individually. More accurate
approaches use the temporal cues in addition to the
appearance of the faces in videos (Zhou et al., 2018;
Wang et al., 2008; Zhou et al., 2004).
The image-based face recognition system nor-
mally involves task that includes face detection, face
alignment, and face matching between a detected face
in a photo or video and a dataset of N faces (Zhou
et al., 2018). However considering the real world sit-
uations, the illumination condition, facial expression,
and occlusion represents challenging problems.
Google (Schroff et al., 2015a) presented a system,
called FaceNet, that map face images to a compact
Euclidean space where distances directly correspond
to a measure of face similarity. These feature vectors
can be used in tasks such as face recognition, verifica-
tion and clustering combined with classical machine
learning techniques. Their method uses a deep convo-
lutional network trained to directly optimize the em-
bedding itself, rather than an intermediate bottleneck
layer in other architectures.
To deal with real video situations, Huang et al.
(Huang et al., 2015) propose a Hybrid Euclidean-and-
Riemannian Metric Learning method to fuse multiple
statistics of image set. They represent each image
set simultaneously by mean, covariance matrix and
Gaussian distribution. To test they approach, they use
use a public large-scale video face datasets: YouTube
Celebrities, containing 1910 video clips of 47 subjects
collected from the web site. Their results are impres-
sive, although it face some problems considering our
context, as for examples, it do not perform person re-
identification. There is no association between im-
ages frames of the same person taken from different
cameras or from the same camera in different occa-
sions.
Masi et al. propose a (Masi et al., 2018) method
that is designed to explicitly tackle pose variations.
Their Pose-Aware Models (PAM) process a face im-
age using several pose-specific and deep convolu-
tional neural networks (CNN). Also, in their appli-
cation, a 3D rendering is used to synthesize multiple
face poses from input images to both train these mod-
els and to provide additional robustness to pose vari-
ations.
Zhou et al. (Zhou et al., 2018) address the prob-
lem of recognition and tracking of multiple faces in
videos involving pose variation and occlusion. They
introduce constraints of inter-frame temporal smooth-
ness and coherence on multiple faces in videos, and
model the tasks of multiple face recognition and mul-
tiple face tracking jointly in an optimization frame-
work. Also, they focus in practical problems involv-
ing surveillance cameras.
Ranajam et al. (Ranjan et al., 2017) propose an
algorithm for simultaneous face detection, landmarks
localization, pose estimation and gender recognition
using deep convolutional neural networks (CNN).
Their proposal consider the intermediate layers of a
CNN using a separate CNN followed by a multi-task
learning algorithm that operates on the fused features.
Their method can detect face, localize land-marks (re
related to the face points), estimate the pose, which is
the order of roll, pitch and yaw, and recognize the gen-
der.They use specialized CNN layers to get each out-
put prediction. Besides the outstanding results, this
approach still seems to be very computational inten-
sive.
Gou et al. (Gou et al., 2017) use the collaborative
representation method to get the outputs of different
sizes of sub image sets, and obtain the final result by
optimally combining these outputs
While outstanding results by these methods
presents advances to tackle different problems for
face recognition, they still present some limitations
for real world problems and industry applications.
A framework that combine the robustness of each
proposal can be used to tackle the optimization prob-
lems as well illumination condition and facial expres-
sion problems. In the next section, we will present the
main aspects of our system.
Globo Face Stream: A System for Video Meta-data Generation in an Entertainment Industry Setting
351