URL-based Web Page Classification - A New Method for URL-based Web Page Classification Using n-Gram Language Models

Tarek Amr Abdallah, Beatriz de la Iglesia

2014

Abstract

This paper is concerned with the classification of web pages using their Uniform Resource Locators (URLs) only. There is a number of contexts these days in which it is important to have an efficient and reliable classification of a web-page from the URL, without the need to visit the page itself. For example, emails or messages sent in social media may contain URLs and require automatic classification. The URL is very concise, and may be composed of concatenated words so classification with only this information is a very challenging task. Much of the current research on URL-based classification has achieved reasonable accuracy, but the current methods do not scale very well with large datasets. In this paper, we propose a new solution based on the use of an n-gram language model. Our solution shows good classification performance and is scalable to larger datasets. It also allows us to tackle the problem of classifying new URLs with unseen sub-sequences.

References

  1. Baykan, E., Henzinger, M., Marian, L., and Weber, I. (2009). Purely url-based topic classification. In Proceedings of the 18th international conference on World wide web, pages 1109-1110. ACM.
  2. Baykan, E., Henzinger, M., and Weber, I. (2008). Web page language identification based on urls. Proceedings of the VLDB Endowment, 1(1):176-187.
  3. Baykan, E., Henzinger, M., and Weber, I. (2013). A comprehensive study of techniques for url-based web page language classification. ACM Transactions on the Web (TWEB), 7(1):3.
  4. Baykan, E., Marian, L., Henzinger, M., and Weber, I. (2011). A comprehensive study of features and algorithms for url-based topic classification. ACM Transactions on the Web (TWEB), 5(3):15.
  5. Chen, S. F. and Goodman, J. (1996). An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th annual meeting on Association for Computational Linguistics, pages 310-318. Association for Computational Linguistics.
  6. Chung, Y., Toyoda, M., and Kitsugeregawa, M. (2010). Topic classification of spam host based on urls. In Proceedings of the Forum on Data Engineering and Information Management (DEIM).
  7. Cooper, W. S. (1995). Some inconsistencies and misidentified modeling assumptions in probabilistic information retrieval. ACM Transactions on Information Systems (TOIS), 13(1):100-111.
  8. Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika, 40(3-4):237-264.
  9. Grau, S., Sanchis, E., Castro, M. J., and Vilar, D. (2004). Dialogue act classification using a bayesian approach. In 9th Conference Speech and Computer.
  10. Jurafsky, D. and Martin, J. (2000). Speech & Language Processing. Pearson Education India.
  11. Kan, M. and Thi, H. (2005). Fast webpage classification using url features. In Proceedings of the 14th ACM international conference on Information and knowledge management, pages 325-326. ACM.
  12. Lavrenko, V. (2009). A generative theory of relevance, volume 26. Springer.
  13. Ma, J., Saul, L. K., Savage, S., and Voelker, G. M. (2009). Identifying suspicious urls: an application of largescale online learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 681-688. ACM.
  14. Manning, C. D. and Schütze, H. (1999). Foundations of statistical natural language processing, volume 999. MIT Press.
  15. Nicolov, N. and Salvetti, F. (2007). Efficient spam analysis for weblogs through url segmentation. Amsterdam studies in the theory and history if linguistic science. Series 4, 292:125.
  16. Peng, F., Huang, X., Schuurmans, D., and Wang, S. (2003). Text classification in asian languages without word segmentation. In Proceedings of the sixth international workshop on Information retrieval with Asian languages-Volume 11, pages 41-48. Association for Computational Linguistics.
  17. Robertson, S. E. and Jones, K. S. (1976). Relevance weighting of search terms. Journal of the American Society for Information science, 27(3):129-146.
  18. Slattery, S. and Craven, M. (1998). Combining statistical and relational methods for learning in hypertext domains. In Inductive Logic Programming, pages 38- 52. Springer.
  19. Sparck Jones, K., Walker, S., and Robertson, S. E. (2000). A probabilistic model of information retrieval: development and comparative experiments: Part 1. Information Processing & Management, 36(6):779-808.
  20. Terra, E. (2005). Simple language models for spam detection. In TREC.
  21. Thomas, K., Grier, C., Ma, J., Paxson, V., and Song, D. (2011). Design and evaluation of a real-time url spam filtering service. In Security and Privacy (SP), 2011 IEEE Symposium on, pages 447-462. IEEE.
  22. Vonitsanou, M., Kozanidis, L., and Stamou, S. (2011). Keywords identification within greek urls. Polibits, (43):75-80.
  23. Witten, I. H. and Bell, T. C. (1991). The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. Information Theory, IEEE Transactions on, 37(4):1085-1094.
  24. Zhai, C. and Lafferty, J. (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 334-342. ACM.
  25. Zhao, P. and Hoi, S. C. (2013). Cost-sensitive online active learning with application to malicious url detection. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 919-927. ACM.
Download


Paper Citation


in Harvard Style

Amr Abdallah T. and de la Iglesia B. (2014). URL-based Web Page Classification - A New Method for URL-based Web Page Classification Using n-Gram Language Models . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014) ISBN 978-989-758-048-2, pages 14-21. DOI: 10.5220/0005030500140021


in Bibtex Style

@conference{kdir14,
author={Tarek Amr Abdallah and Beatriz de la Iglesia},
title={URL-based Web Page Classification - A New Method for URL-based Web Page Classification Using n-Gram Language Models},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)},
year={2014},
pages={14-21},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005030500140021},
isbn={978-989-758-048-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)
TI - URL-based Web Page Classification - A New Method for URL-based Web Page Classification Using n-Gram Language Models
SN - 978-989-758-048-2
AU - Amr Abdallah T.
AU - de la Iglesia B.
PY - 2014
SP - 14
EP - 21
DO - 10.5220/0005030500140021