Table 3: Population Estimates.
Method 1 Method 2 a b z
V 98% CI
Eh/Sed Ht/Man 1225 916 579 1938 1252 1868-2009
Eh/Sed pHash/Ham 1225 1130 754 1837 570 1789-1884
pHash/Ham Ht/Man 1130 916 560 1849 1190 1780-1918
We have identified and published a set of nearly 2,000
near-duplicate clusters which occur within the MIR-
Flickr image collection of one million images. As
both the collection and the near-duplicate subset have
occurred through serendipitous processes, this makes
a valuable test set for the semantic comparison of
near-duplicate finding functions. While work using
the test set is still at an early stage, we have already
made some surprising discoveries in terms of the use
of different metrics with well-known image charac-
terisation functions.
The exhaustive search for near-duplicates within
the set will of course never be finished: any updates
will be gratefully received by the authors, and com-
municated onwards through our website.
We would like to acknowledge help and advice from
Mark Huiskes of Leiden University and Kenneth Pol-
lock of North Carolina State University, for shar-
ing their knowledge about the MIR-Flickr collection
and population statistics respectively. We would also
like to thank Richard Martin and Karina Kubiak-
Ossowska of the University of Strathclyde for help
with access to the ARCHIE-WeSt HPC facilities nec-
essary to achieve some of the analysis.
Franco Alberto Cardillo was supported by the Na-
tional Research Council of Italy (CNR) for a Short-
term Mobility Fellowship (STM), which funded a
stay at the University of Strathclyde in Glasgow (UK)
where part of this work was done.
Bober, M. (2001). Mpeg-7 visual shape descriptors. IEEE
Transactions on circuits and systems for video tech-
nology, 11(6):716–719.
avez, E., Navarro, G., Baeza-Yates, R., and Marroqu
J. L. (2001). Searching in metric spaces. ACM Com-
put. Surv., 33(3):273–321.
Chum, O., Philbin, J., Isard, M., and Zisserman, A. (2007).
Scalable near identical image and shot detection. In
Proceedings of the 6th ACM international conference
on Image and video retrieval, pages 549–556. ACM.
Connor, R. (2015). Mir-flickr near-duplicate data. mir-
Connor, R. and Moss, R. (2012). A multivariate correla-
tion distance for vector spaces. In Navarro, G. and
Pestov, V., editors, Similarity Search and Applica-
tions, volume 7404 of Lecture Notes in Computer Sci-
ence, pages 209–225. Springer Berlin Heidelberg.
Connor, R., Simeoni, F., Iakovos, M., and Moss, R. (2011).
A bounded distance metric for comparing tree struc-
ture. Inf. Syst., 36(4):748–764.
Foo, J., Sinha, R., and Zobel, J. (2006). Discovery of image
versions in large collections. In Cham, T.-J., Cai, J.,
Dorai, C., Rajan, D., Chua, T.-S., and Chia, L.-T., edi-
tors, Advances in Multimedia Modeling, volume 4352
of Lecture Notes in Computer Science, pages 433–
442. Springer Berlin Heidelberg.
Huiskes, M. J. and Lew, M. S. (2008). The MIR Flickr
retrieval evaluation. In MIR ’08: Proceedings of the
2008 ACM International Conference on Multimedia
Information Retrieval, New York, NY, USA. ACM.
Huiskes, M. J., Thomee, B., and Lew, M. S. (2010). New
trends and ideas in visual concept detection: The MIR
Flickr retrieval evaluation initiative. In MIR ’10: Pro-
ceedings of the 2010 ACM International Conference
on Multimedia Information Retrieval, pages 527–536,
New York, NY, USA. ACM.
Jaimes, A., Chang, S.-F., and Loui, A. C. (2003). Detection
of non-identical duplicate consumer photographs. In
Information, Communications and Signal Processing,
2003 and Fourth Pacific Rim Conference on Multime-
dia. Proceedings of the 2003 Joint Conference of the
Fourth International Conference on, volume 1, pages
16–20. IEEE.
Jegou, H., Douze, M., and Schmid, C. (2008). Hamming
embedding and weak geometric consistency for large
scale image search. In Computer Vision–ECCV 2008,
pages 304–317. Springer.
Jinda-Apiraksa, A., Vonikakis, V., and Winkler, S. (2013).
California-nd: An annotated dataset for near-duplicate
detection in personal photo collections. In Quality of
Multimedia Experience (QoMEX), 2013 Fifth Interna-
tional Workshop on, pages 142–147. IEEE.
Kim, H.-S., Chang, H.-W., Lee, J., and Lee, D. (2010).
BASIL: effective near-duplicate image detection us-
ing gene sequence alignment. In Advances in Infor-
mation Retrieval, pages 229–240. Springer.