Using the Cluster-based Tree Structure of k-Nearest Neighbor to Reduce the Effort Required to Classify Unlabeled Large Datasets

Elias Oliveira, Howard Roatti, Matheus de Araujo Nogueira, Henrique Gomes Basoni, Patrick Marques Ciarelli

Abstract

The usual practice in the classification problem is to create a set of labeled data for training and then use it to tune a classifier for predicting the classes of the remaining items in the dataset. However, labeled data demand great human effort, and classification by specialists is normally expensive and consumes a large amount of time. In this paper, we discuss how we can benefit from a cluster-based tree kNN structure to quickly build a training dataset from scratch. We evaluated the proposed method on some classification datasets, and the results are promising because we reduced the amount of labeling work by the specialists to 4% of the number of documents in the evaluated datasets. Furthermore, we achieved an average accuracy of 72.19% on tested datasets, versus 77.12% when using 90% of the dataset for training.

References

  1. Bastos, M. T., Mercea, D., and Charpentier, A. (2015). Tents, Tweets, and Events: The Interplay Between Ongoing Protests and Social Media. Journal of Communication, 65(2):320-350.
  2. Berry, M. W. (2003). Survey of Text Mining: Clustering, Classification, and Retrieval. Springer-Verlag, New York.
  3. Blanzieri, E. and Bryl, A. (2008). A Survey of LearningBased Techniques of Email Spam Filtering. Artificial Intelligence Review, 29(1):63-92.
  4. Brown, R. L. (1995). Accelerated TemplateMatching Using Template Trees Grown by Condensation. IEEE Transactions on Systems, Man and Cybernetics-Part C: Applications and Reviews, 25(3):523-528.
  5. Bruns, A. and Liang, Y. (2012). Tools and Methods for Capturing Twitter Data During Natural Disasters. First Monday, 17(4).
  6. Cai, W., Zhang, Y., and Zhou, J. (2013). Maximizing Expected Model Change for Active Learning in Regression. In Data Mining (ICDM), 2013 IEEE 13th International Conference on, pages 51-60.
  7. Cardoso-Cachopo, A. (2007). Improving Methods for Single-Label Text Categorization. PhD thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa. http://ana.cachopo.org/publications.
  8. Ciarelli, P. M., Krohling, A., and Oliveira, E. (2009). Particle Swarm Optimization Applied to Parameters Learning of Probabilistic Neural Networks for Classification of Economic Activities. I-Tech Education and Publishing, Viena, Austria.
  9. Costa, J., Silva, C., Antunes, M., and Ribeiro, B. (2013). Customized Crowds and Active Learning to Improve Classification. Expert System and Applications, 40(18):7212-7219.
  10. Cover, T. M. and Hart, P. E. (1968). Nearest Neighbor Pattern Classification. IEEE Transactions on Information Theory, pages 21-27.
  11. Cruz, J. A. and Wishart, D. S. (2007). Applications of Machine Learning in Cancer Prediction and Prognosis. Cancer informatics, 2:59-77.
  12. Duarte, J. M. M., Fred, A. L. N., and Duarte, F. J. F. (2013). A Constraint Acquisition Method for Data Clustering. Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pages 108-116.
  13. Duda, R. O., Hart, P. E., and Stork, D. G. (2001). Pattern Classification. Wiley-Interscience, New York, 2 edition.
  14. Gundecha, P. and Liu, H. (2012). Mining Social Media: A Brief Introduction. Tutorials in Operations Research, 1(4).
  15. Hadgu, A. T., Garimella, K., and Weber, I. (2013). Political Hashtag Hijacking in the U.S. In Proceedings of the 22Nd International Conference on World Wide Web Companion, WWW 7813 Companion, pages 55-56, Republic and Canton of Geneva, Switzerland. International World Wide Web Conferences Steering Committee.
  16. Hoi, S., Jin, R., and Lyu, M. (2009). Batch Mode Active Learning with Applications to Text Categorization and Image Retrieval. IEEE Transactions on Knowledge and Data Engineering, 21(9):1233-1248.
  17. Kim, B. S. and Park, S. B. (1986). A Fast k Nearest Neighbor Finding Algorithm Based on the Ordered Partition. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 761-766.
  18. Li, X., Chen, H., Zhang, Z., and Li, J. (2007). Automatic Patent Classification using Citation Network Information: an Experimental Study in Nanotechnology. In JCDL 7807: Proceedings of the 2007 Conference on Digital libraries, pages 419-427, New York, NY, USA. ACM.
  19. Lin, W.-Y., Hu, Y.-H., and Tsai, C.-F. (2012). Machine Learning in Financial Crisis Prediction: A Survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 42(4):421-436.
  20. Lo, S. L., Chiong, R., and Cornforth, D. (2015). Using Support Vector Machine Ensembles for Target Audience Classification on Twitter. Plos One, 10(4):1-20.
  21. Malo, P., Sinha, A., Wallenius, J., and Korhonen, P. (2011). Concept-Based Document Classification Using Wikipedia and Value Function. Journal of the American Society for Information Science and Technology, pages 2496-2511.
  22. Newman, D. J., Hettich, S., Blake, C. L., and Merz, C. J. (1998). UCI Repository of Machine Learning Databases. http://www.ics.uci.edu/~mlearn/ ~MLRepository.html.
  23. Oliveira, E., Basoni, H. G., Saúde, M. R., and Ciarelli, P. M. (2014). Combining Clustering and Classification Approaches for Reducing the Effort of Automatic Tweets Classification. In 6th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, Rome, Italy. IC3K.
  24. Portal, S. T. S. (2015). Leading Social Networks Worldwide as of March 2015, Ranked by Number of Active Users (in millions).
  25. Saito, P. T., de Rezende, P. J., Falco, A. X., Suzuki, C. T., and Gomes, J. F. (2014). An Active Learning Paradigm Based on a Priori Data Reduction and Organization. Expert System and Applications, 41(14):6086-6097.
  26. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1-47.
  27. Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., and Demirbas, M. (2010). Short Text Classification in Twitter to Improve Information Filtering. In 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 7810, pages 841-842, New York, NY, USA. ACM.
  28. Vens, C., Verstrynge, B., and Blockeel, H. (2013). Semisupervised Clustering with Examples Cluster. 5th International Conference on Knowledge Discovery and Information Retrieval, pages 1-7.
  29. Wolfsfeld, G., Segev, E., and Sheafer, T. (2013). Social Media and the Arab Spring: Politics Comes First. The International Journal of Press/Politics, 18(2):115-137.
  30. Zeng, H.-J., Wang, X.-H., Chen, Z., Lu, H., and Ma, W.- Y. (2003). CBC: Clustering Based Text Classification Requiring Minimal Labeled Data. Third IEEE International Conference on Data Mining, pages 443-450.
  31. Zhang, B. and Srihari, S. N. (2004). Fast k-Nearest Neighbor Classification Using Cluster-Based Trees. IEEE Trans. Pattern Anal. Mach. Intell., 26(4):525-528.
Download


Paper Citation


in Harvard Style

Oliveira E., Roatti H., Nogueira M., Basoni H. and Ciarelli P. (2015). Using the Cluster-based Tree Structure of k-Nearest Neighbor to Reduce the Effort Required to Classify Unlabeled Large Datasets . In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: SSTM, (IC3K 2015) ISBN 978-989-758-158-8, pages 567-576. DOI: 10.5220/0005615305670576


in Bibtex Style

@conference{sstm15,
author={Elias Oliveira and Howard Roatti and Matheus de Araujo Nogueira and Henrique Gomes Basoni and Patrick Marques Ciarelli},
title={Using the Cluster-based Tree Structure of k-Nearest Neighbor to Reduce the Effort Required to Classify Unlabeled Large Datasets},
booktitle={Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: SSTM, (IC3K 2015)},
year={2015},
pages={567-576},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005615305670576},
isbn={978-989-758-158-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: SSTM, (IC3K 2015)
TI - Using the Cluster-based Tree Structure of k-Nearest Neighbor to Reduce the Effort Required to Classify Unlabeled Large Datasets
SN - 978-989-758-158-8
AU - Oliveira E.
AU - Roatti H.
AU - Nogueira M.
AU - Basoni H.
AU - Ciarelli P.
PY - 2015
SP - 567
EP - 576
DO - 10.5220/0005615305670576