qRead: A Fast and Accurate Article Extraction Method from Web Pages using Partition Features Optimizations

Jingwen Wang, Jie Wang

Abstract

We present a new method called qRead to achieve real-time content extractions from web pages with high accuracy. Early approaches to content extractions include empirical filtering rules, Document Object Model (DOM) trees, and machine learning models. These methods, while having met with certain success, may not meet the requirements of real-time extraction with high accuracy. For example, constructing a DOM-tree on a complex web page is time-consuming, and using machine learning models could make things unnecessarily more complicated. Different from previous approaches, qRead uses segment densities and similarities to identify main contents. In particular, qRead first filters obvious junk contents, eliminates HTML tags, and partitions the remaining text into natural segments. It then uses the highest ratio of words over the number of lines in a segment combined with similarity between the segment and the title to identify main contents. We show that, through extensive experiments, qRead achieves a 96.8% accuracy on Chinese web pages with an average extraction time of 13.20 milliseconds, and a 93.6% accuracy on English web pages with an average extraction time of 11.37 milliseconds, providing substantial improvements on accuracy over previous approaches and meeting the real-time extraction requirement.

References

  1. Adelberg, B. (1998). Nodose—a tool for semiautomatically extracting structured and semistructured data from text documents. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, SIGMOD 7898, pages 283-294, New York, NY, USA. ACM.
  2. Baluja, S. (2006). Browsing on small screens: Recasting web-page segmentation into an efficient machine learning framework. In Proceedings of the 15th International Conference on World Wide Web, WWW 7806, pages 33-42, New York, NY, USA. ACM.
  3. Bar-Yossef, Z. and Rajagopalan, S. (2002). Template detection via data mining and its applications. In Proceedings of the 11th international conference on World Wide Web, WWW 7802, pages 580-591, New York, NY, USA. ACM.
  4. Chakrabarti, D., Kumar, R., and Punera, K. (2007). Pagelevel template detection via isotonic smoothing. In Proceedings of the 16th international conference on World Wide Web, WWW 7807, pages 61-70, New York, NY, USA. ACM.
  5. Crescenzi, V., Mecca, G., and Merialdo, P. (2001). Roadrunner: Towards automatic data extraction from large web sites. In Proceedings of the 27th International Conference on Very Large Data Bases, VLDB 7801, pages 109-118, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  6. Debnath, S., Mitra, P., Pal, N., and Giles, C. L. (2005). Automatic identification of informative sections of web pages. IEEE Trans. on Knowl. and Data Eng., 17(9):1233-1246.
  7. Gibson, D., Punera, K., and Tomkins, A. (2005). The volume and evolution of web page templates. In Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, WWW 7805, pages 830-839, New York, NY, USA. ACM.
  8. Gibson, J., Wellner, B., and Lubar, S. (2007). Adaptive web-page content identification. In Proceedings of the 9th Annual ACM International Workshop on Web Information and Data Management, WIDM 7807, pages 105-112, New York, NY, USA. ACM.
  9. Guo, Y., Tang, H., Song, L., Wang, Y., and Ding, G. (2010). Econ: An approach to extract content from web news page. In Web Conference (APWEB), 2010 12th International Asia-Pacific , pages 314-320.
  10. Jia, M. and Wang, J. (2014). Handling big data of online social networks on a small machine. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 8591 LNCS:676-685. cited By 0.
  11. Joshi, P. M. and Liu, S. (2009). Web document text and images extraction using dom analysis and natural language processing. In Proceedings of the 9th ACM Symposium on Document Engineering, DocEng 7809, pages 218-221, New York, NY, USA. ACM.
  12. Kao, H.-Y., Chen, M.-S., Lin, S.-H., and Ho, J.-M. (2002). Entropy-based link analysis for mining web informative structures. In Proceedings of the Eleventh International Conference on Information and Knowledge Management, CIKM 7802, pages 574-581, New York, NY, USA. ACM.
  13. Kao, H.-Y., Lin, S.-H., Ho, J.-M., and Chen, M.-S. (2004). Mining web informative structures and contents based on entropy analysis. IEEE Trans. on Knowl. and Data Eng., 16(1):41-55.
  14. Kohlschütter, C., Fankhauser, P., and Nejdl, W. (2010). Boilerplate detection using shallow text features. In Proceedings of the third ACM international conference on Web search and data mining, WSDM 7810, pages 441-450, New York, NY, USA. ACM.
  15. Liu, L., Pu, C., and Han, W. (2000). Xwrap: an xmlenabled wrapper construction system for web information sources. In Data Engineering, 2000. Proceedings. 16th International Conference on, pages 611- 621.
  16. Pasternack, J. and Roth, D. (2009). Extracting article text from the web with maximum subsequence segmentation. In WWW.
  17. Prasad, J. and Paepcke, A. (2008). Coreex: Content extraction from online news articles. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM 7808, pages 1391-1392, New York, NY, USA. ACM.
  18. Uzun, E., Agun, H. V., and Yerlikaya, T. (2013). A hybrid approach for extracting informative content from web pages. Inf. Process. Manage., 49(4):928-944.
  19. Weninger, T. and Hsu, W. H. (2008). Text extraction from the web via text-to-tag ratio. In Proceedings of the 2008 19th International Conference on Database and Expert Systems Application, DEXA 7808, pages 23-28, Washington, DC, USA. IEEE Computer Society.
  20. Yi, L., Liu, B., and Li, X. (2003). Eliminating noisy information in web pages for data mining. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 7803, pages 296-305, New York, NY, USA. ACM.
  21. Ziegler, C.-N. and Skubacz, M. (2007). Content extraction from news pages using particle swarm optimization on linguistic and structural features. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, WI 7807, pages 242-249, Washington, DC, USA. IEEE Computer Society.
Download


Paper Citation


in Harvard Style

Wang J. and Wang J. (2015). qRead: A Fast and Accurate Article Extraction Method from Web Pages using Partition Features Optimizations . In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2015) ISBN 978-989-758-158-8, pages 364-371. DOI: 10.5220/0005613603640371


in Bibtex Style

@conference{kdir15,
author={Jingwen Wang and Jie Wang},
title={qRead: A Fast and Accurate Article Extraction Method from Web Pages using Partition Features Optimizations},
booktitle={Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2015)},
year={2015},
pages={364-371},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005613603640371},
isbn={978-989-758-158-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2015)
TI - qRead: A Fast and Accurate Article Extraction Method from Web Pages using Partition Features Optimizations
SN - 978-989-758-158-8
AU - Wang J.
AU - Wang J.
PY - 2015
SP - 364
EP - 371
DO - 10.5220/0005613603640371