EXTRACTING MAIN CONTENT-BLOCKS FROM BLOG POSTS

Saiful Akbar, Laura Slaughter, Øystein Nytrø

Abstract

A blog post typically contains defined blocks containing different information such as the main content, a blogger profile, links to blog archives, comments, and even advertisements. Thus, identifying and extracting the main/content block of blog posts or web pages in general is important for information extraction purposes before further processing. This paper describes our approach for extracting main/content block from blog posts with disparate types of blog mark-up. Adapting the Content Structure Tree (CST)-based approach, our approach proposed a new consideration in calculating the importance of HTML content nodes and in definition of the attenuation quotient suffered by HTML item/block nodes. Performance using this approach is increased because posts published in the same domain tend to have similar page template, such that a general main content marker could be applied for them. The approach consists of two steps. In the first step, the approach employs the modified CST approach for detecting the primary and secondary markers for page cluster. In the next step, it uses HTMLFilter to extract the main block of a page, based on the detected markers. When HTMLFilter cannot find the main block, the modified CST is used as the second alternative. Some experiments showed that the approach can extract main block with an accuracy of more than 94%.

References

  1. Tseng, Y. F., Kao, H. Y., 2006. The Mining and Extraction of Primary Informative Blocks and Data Objects from Systematic Web Pages. Proceeding of the 2006 IEEE/WIC/ACM International Conference, pp. 370-373.
  2. Attardi, G., Simi, M., 2006. Blog Mining Through Opinionated Words. Proceeding of Text Retrieval Conference.
  3. Li, Y., Yang, J., 2009. A Novel Method to Extract Informative Blocks from Web Pages. International Joint Conference on Artificial Intelligence, pp.536- 539.
  4. Lin, S. H., Ho, J. M., 2002. Discovering Informative Content Blocks from Web Documents. SIGKDD, pp 588 - 593.
  5. Debnath, S., Mitra P., Giles, C. L., , 2005. Automatic Extraction of Informative Blocks from Webpages. Proceedings of the 2005 ACM symposium on Applied computing, pp. 1722 - 1726.
Download


Paper Citation


in Harvard Style

Akbar S., Slaughter L. and Nytrø Ø. (2010). EXTRACTING MAIN CONTENT-BLOCKS FROM BLOG POSTS . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010) ISBN 978-989-8425-28-7, pages 438-443. DOI: 10.5220/0003091804380443


in Bibtex Style

@conference{kdir10,
author={Saiful Akbar and Laura Slaughter and Øystein Nytrø},
title={EXTRACTING MAIN CONTENT-BLOCKS FROM BLOG POSTS},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)},
year={2010},
pages={438-443},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003091804380443},
isbn={978-989-8425-28-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)
TI - EXTRACTING MAIN CONTENT-BLOCKS FROM BLOG POSTS
SN - 978-989-8425-28-7
AU - Akbar S.
AU - Slaughter L.
AU - Nytrø Ø.
PY - 2010
SP - 438
EP - 443
DO - 10.5220/0003091804380443