ACKNOWLEDGEMENTS
This work was supported in part by the NSF under
grant CNS-1331632.
REFERENCES
Adelberg, B. (1998). Nodose—a tool for semi-
automatically extracting structured and semistruc-
tured data from text documents. In Proceedings of
the 1998 ACM SIGMOD International Conference on
Management of Data, SIGMOD ’98, pages 283–294,
New York, NY, USA. ACM.
Baluja, S. (2006). Browsing on small screens: Recast-
ing web-page segmentation into an efficient machine
learning framework. In Proceedings of the 15th Inter-
national Conference on World Wide Web, WWW ’06,
pages 33–42, New York, NY, USA. ACM.
Bar-Yossef, Z. and Rajagopalan, S. (2002). Template detec-
tion via data mining and its applications. In Proceed-
ings of the 11th International Conference on World
Wide Web, WWW ’02, pages 580–591, New York,
NY, USA. ACM.
Chakrabarti, D., Kumar, R., and Punera, K. (2007). Page-
level template detection via isotonic smoothing. In
Proceedings of the 16th International Conference on
World Wide Web, WWW ’07, pages 61–70, New York,
NY, USA. ACM.
Crescenzi, V., Mecca, G., and Merialdo, P. (2001). Road-
runner: Towards automatic data extraction from large
web sites. In Proceedings of the 27th International
Conference on Very Large Data Bases, VLDB ’01,
pages 109–118, San Francisco, CA, USA. Morgan
Kaufmann Publishers Inc.
Debnath, S., Mitra, P., Pal, N., and Giles, C. L. (2005).
Automatic identification of informative sections of
web pages. IEEE Trans. on Knowl. and Data Eng.,
17(9):1233–1246.
Gibson, D., Punera, K., and Tomkins, A. (2005). The vol-
ume and evolution of web page templates. In Special
Interest Tracks and Posters of the 14th International
Conference on World Wide Web, WWW ’05, pages
830–839, New York, NY, USA. ACM.
Gibson, J., Wellner, B., and Lubar, S. (2007). Adaptive
web-page content identification. In Proceedings of the
9th Annual ACM International Workshop on Web In-
formation and Data Management, WIDM ’07, pages
105–112, New York, NY, USA. ACM.
Guo, Y., Tang, H., Song, L., Wang, Y., and Ding, G. (2010).
Econ: An approach to extract content from web news
page. In Web Conference (APWEB), 2010 12th Inter-
national Asia-Pacific, pages 314–320.
Jia, M. and Wang, J. (2014). Handling big data of online
social networks on a small machine. Lecture Notes in
Computer Science (including subseries Lecture Notes
in Artificial Intelligence and Lecture Notes in Bioin-
formatics), 8591 LNCS:676–685. cited By 0.
Joshi, P. M. and Liu, S. (2009). Web document text and
images extraction using dom analysis and natural lan-
guage processing. In Proceedings of the 9th ACM
Symposium on Document Engineering, DocEng ’09,
pages 218–221, New York, NY, USA. ACM.
Kao, H.-Y., Chen, M.-S., Lin, S.-H., and Ho, J.-M. (2002).
Entropy-based link analysis for mining web informa-
tive structures. In Proceedings of the Eleventh Inter-
national Conference on Information and Knowledge
Management, CIKM ’02, pages 574–581, New York,
NY, USA. ACM.
Kao, H.-Y., Lin, S.-H., Ho, J.-M., and Chen, M.-S. (2004).
Mining web informative structures and contents based
on entropy analysis. IEEE Trans. on Knowl. and Data
Eng., 16(1):41–55.
Kohlsch
¨
utter, C., Fankhauser, P., and Nejdl, W. (2010).
Boilerplate detection using shallow text features. In
Proceedings of the Third ACM International Confer-
ence on Web Search and Data Mining, WSDM ’10,
pages 441–450, New York, NY, USA. ACM.
Liu, L., Pu, C., and Han, W. (2000). Xwrap: an xml-
enabled wrapper construction system for web infor-
mation sources. In Data Engineering, 2000. Proceed-
ings. 16th International Conference on, pages 611–
621.
Pasternack, J. and Roth, D. (2009). Extracting article text
from the web with maximum subsequence segmenta-
tion. In WWW.
Prasad, J. and Paepcke, A. (2008). Coreex: Content extrac-
tion from online news articles. In Proceedings of the
17th ACM Conference on Information and Knowledge
Management, CIKM ’08, pages 1391–1392, New
York, NY, USA. ACM.
Uzun, E., Agun, H. V., and Yerlikaya, T. (2013). A hybrid
approach for extracting informative content from web
pages. Inf. Process. Manage., 49(4):928–944.
Weninger, T. and Hsu, W. H. (2008). Text extraction from
the web via text-to-tag ratio. In Proceedings of the
2008 19th International Conference on Database and
Expert Systems Application, DEXA ’08, pages 23–28,
Washington, DC, USA. IEEE Computer Society.
Yi, L., Liu, B., and Li, X. (2003). Eliminating noisy infor-
mation in web pages for data mining. In Proceedings
of the Ninth ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, KDD ’03,
pages 296–305, New York, NY, USA. ACM.
Ziegler, C.-N. and Skubacz, M. (2007). Content extraction
from news pages using particle swarm optimization
on linguistic and structural features. In Proceedings
of the IEEE/WIC/ACM International Conference on
Web Intelligence, WI ’07, pages 242–249, Washing-
ton, DC, USA. IEEE Computer Society.
qRead: A Fast and Accurate Article Extraction Method from Web Pages using Partition Features Optimizations
371