Identification of MIR-Flickr Near-duplicate Images - A Benchmark Collection for Near-duplicate Detection

Richard Connor, Stewart MacKenzie-Leigh, Franco Alberto Cardillo, Robert Moss

Abstract

There are many contexts where the automated detection of near-duplicate images is important, for example the detection of copyright infringement or images of child abuse. There are many published methods for the detection of similar and near-duplicate images; however it is still uncommon for methods to be objectively compared with each other, probably because of a lack of any good framework in which to do so. Published sets of near-duplicate images exist, but are typically small, specialist, or generated. Here, we give a new test set based on a large, serendipitously selected collection of high quality images. Having observed that the MIR-Flickr 1M image set contains a significant number of near-duplicate images, we have discovered the majority of these. We disclose a set of 1,958 near-duplicate clusters from within the set, and show that this is very likely to contain almost all of the near-duplicate pairs that exist. The main contribution of this publication is the identification of these images, which may then be used by other authors to make comparisons as they see fit. In particular however, near-duplicate classification functions may now be accurately tested for sensitivity and specificity over a general collection of images.

References

  1. Bober, M. (2001). Mpeg-7 visual shape descriptors. IEEE Transactions on circuits and systems for video technology, 11(6):716-719.
  2. Chávez, E., Navarro, G., Baeza-Yates, R., and Marroquín, J. L. (2001). Searching in metric spaces. ACM Comput. Surv., 33(3):273-321.
  3. Chum, O., Philbin, J., Isard, M., and Zisserman, A. (2007). Scalable near identical image and shot detection. In Proceedings of the 6th ACM international conference on Image and video retrieval, pages 549-556. ACM.
  4. Connor, R. (2015). Mir-flickr near-duplicate data. mirflickr-near-duplicates.appspot.com.
  5. Connor, R. and Moss, R. (2012). A multivariate correlation distance for vector spaces. In Navarro, G. and Pestov, V., editors, Similarity Search and Applications, volume 7404 of Lecture Notes in Computer Science, pages 209-225. Springer Berlin Heidelberg.
  6. Connor, R., Simeoni, F., Iakovos, M., and Moss, R. (2011). A bounded distance metric for comparing tree structure. Inf. Syst., 36(4):748-764.
  7. Foo, J., Sinha, R., and Zobel, J. (2006). Discovery of image versions in large collections. In Cham, T.-J., Cai, J., Dorai, C., Rajan, D., Chua, T.-S., and Chia, L.-T., editors, Advances in Multimedia Modeling, volume 4352 of Lecture Notes in Computer Science, pages 433- 442. Springer Berlin Heidelberg.
  8. Huiskes, M. J. and Lew, M. S. (2008). The mir flickr retrieval evaluation. In MIR 7808: Proceedings of the 2008 ACM International Conference on Multimedia Information Retrieval, New York, NY, USA. ACM.
  9. Huiskes, M. J., Thomee, B., and Lew, M. S. (2010). New trends and ideas in visual concept detection: The MIR Flickr retrieval evaluation initiative. In MIR 7810: Proceedings of the 2010 ACM International Conference on Multimedia Information Retrieval, pages 527-536, New York, NY, USA. ACM.
  10. Jaimes, A., Chang, S.-F., and Loui, A. C. (2003). Detection of non-identical duplicate consumer photographs. In Information, Communications and Signal Processing, 2003 and Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint Conference of the Fourth International Conference on, volume 1, pages 16-20. IEEE.
  11. Jegou, H., Douze, M., and Schmid, C. (2008). Hamming embedding and weak geometric consistency for large scale image search. In Computer Vision-ECCV 2008, pages 304-317. Springer.
  12. Jinda-Apiraksa, A., Vonikakis, V., and Winkler, S. (2013). California-nd: An annotated dataset for near-duplicate detection in personal photo collections. In Quality of Multimedia Experience (QoMEX), 2013 Fifth International Workshop on, pages 142-147. IEEE.
  13. Kim, H.-S., Chang, H.-W., Lee, J., and Lee, D. (2010). BASIL: effective near-duplicate image detection using gene sequence alignment. In Advances in Information Retrieval, pages 229-240. Springer.
  14. Nister, D. and Stewenius, H. (2006). Scalable recognition with a vocabulary tree. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 2, pages 2161-2168. IEEE.
  15. Niu, X.-m. and Jiao, Y.-h. (2008). An overview of perceptual hashing. Acta Electronica Sinica, 36(7):1405- 1411.
  16. Over, P. (2014). TREC Video Retrieval Evaluation: TRECVID. http://trecvid.nist.gov/.
  17. Pollock, K. H., Nichols, J. D., Brownie, C., and Hines, J. E. (1990). Statistical inference for capture-recapture experiments. Wildlife monographs, pages 3-97.
  18. Vonikakis, V., Jinda-Apiraksa, A., and Winkler, S. (2014). Photocluster - a multi-clustering technique for nearduplicate detection in personal photo collections. In Proc. of the 9th International Conference on Computer Vision Theory and Applications, pages 153-161.
  19. Won, C. S., Park, D. K., and Park, S.-J. (2002). Efficient use of mpeg-7 edge histogram descriptor. Etri Journal, 24(1):23-30.
  20. Zezula, P., Amato, G., Dohnal, V., and Batko, M. (2006). Similarity search: the metric space approach, volume 32. Springer.
Download


Paper Citation


in Harvard Style

Connor R., MacKenzie-Leigh S., Cardillo F. and Moss R. (2015). Identification of MIR-Flickr Near-duplicate Images - A Benchmark Collection for Near-duplicate Detection . In Proceedings of the 10th International Conference on Computer Vision Theory and Applications - Volume 2: VISAPP, (VISIGRAPP 2015) ISBN 978-989-758-090-1, pages 565-571. DOI: 10.5220/0005359705650571


in Bibtex Style

@conference{visapp15,
author={Richard Connor and Stewart MacKenzie-Leigh and Franco Alberto Cardillo and Robert Moss},
title={Identification of MIR-Flickr Near-duplicate Images - A Benchmark Collection for Near-duplicate Detection},
booktitle={Proceedings of the 10th International Conference on Computer Vision Theory and Applications - Volume 2: VISAPP, (VISIGRAPP 2015)},
year={2015},
pages={565-571},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005359705650571},
isbn={978-989-758-090-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 10th International Conference on Computer Vision Theory and Applications - Volume 2: VISAPP, (VISIGRAPP 2015)
TI - Identification of MIR-Flickr Near-duplicate Images - A Benchmark Collection for Near-duplicate Detection
SN - 978-989-758-090-1
AU - Connor R.
AU - MacKenzie-Leigh S.
AU - Cardillo F.
AU - Moss R.
PY - 2015
SP - 565
EP - 571
DO - 10.5220/0005359705650571