A Comparative Study on Outlier Removal from a Large-scale Dataset using Unsupervised Anomaly Detection

Markus Goldstein, Seiichi Uchida

Abstract

Outlier removal from training data is a classical problem in pattern recognition. Nowadays, this problem becomes more important for large-scale datasets by the following two reasons: First, we will have a higher risk of “unexpected” outliers, such as mislabeled training data. Second, a large-scale dataset makes it more difficult to grasp the distribution of outliers. On the other hand, many unsupervised anomaly detection methods have been proposed, which can be also used for outlier removal. In this paper, we present a comparative study of nine different anomaly detection methods in the scenario of outlier removal from a large-scale dataset. For accurate performance observation, we need to use a simple and describable recognition procedure and thus utilize a nearest neighbor-based classifier. As an adequate large-scale dataset, we prepared a handwritten digit dataset comprising of more than 800,000 manually labeled samples. With a data dimensionality of 16×16 = 256, it is ensured that each digit class has at least 100 times more instances than data dimensionality. The experimental results show that the common understanding that outlier removal improves classification performance on small datasets is not true for high-dimensional large-scale datasets. Additionally, it was found that local anomaly detection algorithms perform better on this data than their global equivalents.

References

  1. Amer, M. and Goldstein, M. (2012). Nearest-neighbor and clustering based anomaly detection algorithms for rapidminer. In Simon Fischer, I. M., editor, Proceedings of the 3rd RapidMiner Community Meeting and Conferernce (RCOMM 2012), pages 1-12. Shaker Verlag GmbH.
  2. Amer, M., Goldstein, M., and Abdennadher, S. (2013). Enhancing one-class support vector machines for unsupervised anomaly detection. In Proceedings of the ACM SIGKDD Workshop on Outlier Detection and Description (ODD 7813), pages 8-15, New York, NY, USA. ACM Press.
  3. Angiulli, F. and Pizzuti, C. (2002). Fast outlier detection in high dimensional spaces. In Elomaa, T., Mannila, H., and Toivonen, H., editors, Principles of Data Mining and Knowledge Discovery, volume 2431 of Lecture Notes in Computer Science, pages 43-78. Springer Berlin / Heidelberg.
  4. Barnett, V. and Lewis, T. (1994). Outliers in Statistical Data. Wiley Series in Probability & Statistics. Wiley.
  5. Basharat, A., Gritai, A., and Shah, M. (2008). Learning object motion patterns for anomaly detection and improved object detection. In Computer Vision and Pattern Recognition. (CVPR 2008). IEEE Conference on, pages 1-8. IEEE Computer Society Press.
  6. Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. (2000). LOF: Identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pages 93- 104, Dallas, Texas, USA. ACM Press.
  7. Chandola, V., Banerjee, A., and Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys, 41(3):1-58.
  8. Gebhardt, J., Goldstein, M., Shafait, F., and Dengel, A. (2013). Document authentication using printing technique features and unsupervised anomaly detection. In Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR 2013), pages 479-483. IEEE Computer Society Press.
  9. Goldstein, M. (2014). Anomaly Detection in Large Datasets. Phd-thesis, University of Kaiserslautern, Germany.
  10. Goldstein, M. and Dengel, A. (2012). Histogram-based outlier score (hbos): A fast unsupervised anomaly detection algorithm. In W ölfl, S., editor, KI-2012: Poster and Demo Track, pages 59-63. Online.
  11. Grubbs, F. E. (1969). Procedures for detecting outlying observations in samples. Technometrics, 11(1):1-21.
  12. Guyon, I., Matic, N., and Vapnik, V. (1996). Discovering informative patterns and data cleaning. Advances in Knowledge Discovery and Data Mining, pages 181- 203.
  13. Hawkins, S., He, H., Williams, G. J., and Baxter, R. A. (2000). Outlier detection using replicator neural networks. In Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2000), pages 170-180, London, UK. Springer-Verlag.
  14. He, Z., Xu, X., and Deng, S. (2003). Discovering clusterbased local outliers. Pattern Recognition Letters, 24(9-10):1641-1650.
  15. Jin, W., Tung, A., Han, J., and Wang, W. (2006). Ranking outliers using symmetric neighborhood relationship. In Ng, W.-K., Kitsuregawa, M., Li, J., and Chang, K., editors, Advances in Knowledge Discovery and Data Mining, volume 3918 of Lecture Notes in Computer Science, pages 577-593. Springer Berlin / Heidelberg.
  16. Kriegel, H.-P., Kr öger, P., Schubert, E., and Zimek, A. (2009). Loop: Local outlier probabilities. In Proceeding of the 18th ACM Conference on Information and Knowledge Management (CIKM 7809), pages 1649- 1652, New York, NY, USA. ACM Press.
  17. Lin, J., Keogh, E., Fu, A., and Herle, H. V. (2005). Approximations to magic: Finding unusual medical time series. In In 18th IEEE Symposium on Computer-Based Medical Systems (CBMS), pages 23-24. IEEE Computer Society Press.
  18. Lindsay, B. (1995). Mixture Models: Theory, Geometry, and Applications. NSF-CBMS Regional Conference Series in Probability and Statistics. Institute of Mathematical Statistics, Penn. State University.
  19. Mehrotra, K., Mohan, C. K., and Ranka, S. (1997). Elements of Artificial Neural Networks . MIT Press, Cambridge, MA, USA.
  20. Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., and Euler, T. (2006). Yale: Rapid prototyping for complex data mining tasks. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2006), pages 935-940, New York, NY, USA. ACM Press.
  21. Portnoy, L., Eskin, E., and Stolfo, S. (2001). Intrusion detection with unlabeled data using clustering. In In Proceedings of ACM CSS Workshop on Data Mining Applied to Security (DMSA-2001), pages 5-8.
  22. Ramaswamy, S., Rastogi, R., and Shim, K. (2000). Efficient algorithms for mining outliers from large data sets. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD 7800), pages 427-438, New York, NY, USA. ACM Press.
  23. Schölkopf, B. and Smola, A. J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA.
  24. Schölkopf, B., Williamson, R. C., Smola, A. J., ShaweTaylor, J., and Platt, J. C. (1999). Support vector method for novelty detection. In Advances in Neural Information Processing Systems 12 (NIPS), pages 582-588. The MIT Press.
  25. Sharma, P. K., Haleem, H., and Ahmad, T. (2015). Improving classification by outlier detection and removal. In Emerging ICT for Bridging the Future - Proceedings of the 49th Annual Convention of the Computer Society of India CSI Volume 2, volume 338 of Advances in Intelligent Systems and Computing, pages 621-628. Springer International Publishing.
  26. Smith, M. and Martinez, T. (2011). Improving classification accuracy by identifying and removing instances that should be misclassified. In Neural Networks (IJCNN), The 2011 International Joint Conference on, pages 2690-2697.
  27. Tang, J., Chen, Z., Fu, A., and Cheung, D. (2002). Enhancing effectiveness of outlier detections for low density patterns. In Chen, M.-S., Yu, P., and Liu, B., editors, Advances in Knowledge Discovery and Data Mining, volume 2336 of Lecture Notes in Computer Science, pages 535-548. Springer Berlin / Heidelberg.
  28. Turlach, B. A. (1993). Bandwidth selection in kernel density estimation: A review.
Download


Paper Citation


in Harvard Style

Goldstein M. and Uchida S. (2016). A Comparative Study on Outlier Removal from a Large-scale Dataset using Unsupervised Anomaly Detection . In Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-173-1, pages 263-269. DOI: 10.5220/0005701302630269


in Bibtex Style

@conference{icpram16,
author={Markus Goldstein and Seiichi Uchida},
title={A Comparative Study on Outlier Removal from a Large-scale Dataset using Unsupervised Anomaly Detection},
booktitle={Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2016},
pages={263-269},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005701302630269},
isbn={978-989-758-173-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - A Comparative Study on Outlier Removal from a Large-scale Dataset using Unsupervised Anomaly Detection
SN - 978-989-758-173-1
AU - Goldstein M.
AU - Uchida S.
PY - 2016
SP - 263
EP - 269
DO - 10.5220/0005701302630269