Table 1: Results of tests with and without augmentation us-
ing the probabilistic model described.
Metric R@1 R@5
CLIP 0.140 0.306
Faces 0.069 0.120
CLIP+Faces 0.218 0.396
5 EXPERIMENTS
5.1 Efficacy of Prototype
A test was run to see whether augmenting CLIP with
face recognition according to the described method
improved results.
Two tests were run on a sample of the same 1000
randomly sampled images and their descriptions from
2020.
In one test only CLIP was used. In the other
CLIP was augmented using the proposed probabilis-
tic model. Face-shot clusters and probability esti-
mates were obtained from images and descriptions
from 2019. Results can be seen in table 1.
6 CONCLUSION AND FUTURE
WORK
We got 21.8% R@1 metric by using the face recogni-
tion prototype as compared to 14.0% for CLIP alone,
which shows that the described method of CLIP aug-
mentation does work.
The method of CLIP augmentation can likely be
used for different types of objects aswell, like com-
pany logos, buildings or other things that are unlikely
to have been in the CLIP training data.
However there is much room for future work. The
boolean model of names and faces being detected
likely discards some useful information. If the model
could be modified to take into account information
about distance to clusters, that could be a potential
avenue for improvement.
Another possibility for future work might be to re-
lax some of the independence assumptions. For in-
stance one could cluster people who are more likely
to appear together, and use that to improve the esti-
mate of whether an image is relevant.
7 NOTE ON RESPONSIBLE USE
The authors used face recognition only for purposes
of enhancing image search of public figures in LETA’s
internal image database for journalists. The authors
strongly advise against it’s use in cases where it could
undermine people’s privacy.
ACKNOWLEDGEMENTS
The research was supported by ERDF project
1.1.1.1/18/A/045 at IMCS, University of Latvia.
This research is funded by the Latvian Council of
Science, project No. lzp-2021/1-0479.
REFERENCES
Clip face recognition. https://openai.com/blog/clip/. [On-
line; accessed 16-November-2021].
Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter,
S., and Vollgraf, R. (2019). Flair: An easy-to-
use framework for state-of-the-art nlp. In NAACL
2019, 2019 Annual Conference of the North Amer-
ican Chapter of the Association for Computational
Linguistics (Demonstrations), pages 54–59.
Geitgey, A. (2020). Python face
recognition library. https://
pypi.org/project/face-recognition/. [Online; accessed
16-November-2021].
Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017).
On calibration of modern neural networks.
Jaynes, E. T. (2003). Probability Theory: The Logic of Sci-
ence.
Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham,
H., Le, Q. V., Sung, Y., Li, Z., and Duerig, T. (2021).
Scaling up visual and vision-language representation
learning with noisy text supervision.
Kennedy, L. and Naaman, M. (2008). Generating diverse
and representative image search results for landmarks.
In in Pro c. 17th Int. Conf. World Wide Web, pages
297–306.
King, D. E. (2009). Dlib-ml: A machine learning toolkit.
Journal of Machine Learning Research, 10:1755–
1758.
Pham, H., Dai, Z., Ghiasi, G., Liu, H., Yu, A. W., Luong,
M.-T., Tan, M., and Le, Q. V. (2021). Combined scal-
ing for zero-shot transfer learning.
Platt, J. et al. (1999). Probabilistic outputs for support vec-
tor machines and comparisons to regularized likeli-
hood methods. Advances in large margin classifiers,
10(3):61–74.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
J., Krueger, G., and Sutskever, I. (2021). Learning
transferable visual models from natural language su-
pervision. CoRR, abs/2103.00020.
Romberg, S., Lienhart, R., and H
¨
orster, E. (2012). Multi-
modal image retrieval. International Journal of Mul-
timedia Information Retrieval, 1.
CLIP Augmentation for Image Search
77