images by again deriving embeddings and comparing
them to previously stored templates and each other.
This is inefficient on two different levels: First, re-
quiring multiple (on a holistic system point of view,
redundant) similarity calculations would hurt both
provider diversification and hinder small providers in
serving larger quantities of users because of increased
hardware requirements. Ideally, one could combine
different aspects of these multiple embeddings ex-
tracted from face images with as little data as pos-
sible. Since research on still image face recognition
is quite extensive, and an embedded camera sensor
device can often derive embeddings of the currently
visible person on-line, creating a new, aggregated em-
bedding based on all images available of an individ-
ual would not change the backbone of state-of-the-art
face recognition pipelines. Secondly, having a single
(aggregated) embedding, thus not depending on mul-
tiple similarity computations minimizes network traf-
fic, which is especially significant for decentralized,
embedded systems.
In this paper, our focus is on evaluating different
methods of aggregating face embeddings (Section 3)
from an efficiency and accuracy point of view. Fur-
thermore, we test the limit of sufficient image quan-
tity and analyze whether there is a clear point where
adding additional images does not significantly in-
crease face recognition accuracy. Last but not least,
in order to verify if using multiple images in different
settings boosts accuracy significantly, we propose a
new in-the-wild dataset, where subjects take around
50 images of themselves in a single setting, which
only takes around 3 seconds to achieve practical us-
ability. Additional images in radically different set-
tings are used as approximation of the true embedding
to verify the performance improvements.
2 MULTI-IMAGE FACE
RECOGNITION
In order to evaluate and compare different face recog-
nition methods, they are tested against public datasets.
Many of these face recognition datasets typically are
high quality (to facilitate training face recognition
models) and high quantity (to reduce the bias).
Most datasets define a fixed set of pairs of images
to allow for objective evaluation of face recognition
methods. With this strategy, a single image is used as
template in state-of-the-art face recognition pipelines.
This template is then compared with positive (same
person) and negative (different person) matches. This
approach tests one important metric of face recogni-
tion: How well it performs on still images. Compared
to more complex scenarios, only testing on still im-
ages is efficient at runtime, which decreases computa-
tion time to evaluate the accuracy on a dataset. How-
ever, there are different aspects this method does not
test, such as how to handle multiple images or even
video streams of a person.
In reality, these ignored aspects are essential,
as live-images from cameras do not produce high-
quality images similar to the images from many avail-
able and commonly used face recognition datasets.
Instead, the person-camera angle is far from optimal,
the person is not directly in front of the camera and
thus the face is quite small. Furthermore, the face can
be occluded, e.g. with a scarf, sunglasses, or hair. In
these real-world settings, face recognition pipelines
have a harder time recognizing people than with pub-
lic datasets, although new datasets try to represent
these challenges. Nevertheless, there is a potential
benefit of real-world scenarios: There are many im-
ages of a single person available, as the person is pre-
sumably visible for (at least) many seconds and thus a
camera is able to capture significantly more than one
image.
One way of bridging the gap of having multiple
images of the same person and being of lower quality
is to merge the embeddings obtained from multiple
images into a single embedding. More accurate tem-
plates, by definition, lead to accuracy improvements
of face recognition. The idea behind using multiple
images is that it is not possible to capture a perfect
representation of a face in a single picture due to var-
ious reasons (occlusion, lighting, accessories, . . . ).
While a single image cannot account for all these
different settings, multiple images can capture differ-
ent face areas and settings. Therefore, using multiple
images provides more information about the individ-
ual’s face, and we therefore expect an increased ac-
curacy. As introduced in Section 1, due to hardware
and network constraints, comparing the current live-
image with multiple embeddings of the same person
is not favorable in some situations. For an efficient
face recognition pipeline, it would be best to only
have a single embedding which is used as template for
a person. This would allow the system to make use of
the vast literature on single face image recognition.
In contrast to this single-embedding approach, in
recent years other work is published in the domain
of video face recognition (Rao et al., 2017; Rivero-
Hern
´
andez et al., 2021; Zheng et al., 2020; Liu et al.,
2019; Gong et al., 2019). Most of these papers
propose an additional neural network to perform the
weighting of different embeddings (Liu et al., 2019;
Rivero-Hern
´
andez et al., 2021; Yang et al., 2017;
Gong et al., 2019). Especially on embedded devices,
ICISSP 2023 - 9th International Conference on Information Systems Security and Privacy
280