Table 5: Comparative results for title extraction methods.
Method Average similarity
Baseline 0.62
TitleFinder 0.52
TTA 0.84
Figure 5: Similarities of the detected titles with the ground
truth titles.
4 CONCLUSIONS
In this paper, we propose a fully automated method to
extract titles from web pages without extensive needs
of training data or user interaction. The proposed
method analyses the content of the title and meta tags,
and it extracts suitable sub string to represent the
content of the web page. It should be short, but
informative to be used in both web and mobile
devices. The method, Title Tag Analyzer (TTA), is
integrated with Mopsi search to summarize the
retrieved web pages.
We conducted various experiments to evaluate the
performance of TTA and our findings are as follows:
The proposed method significantly outperforms
the baseline from 0.62 to 0.84 in the average
similarity.
Title and meta tags usually contain the correct
title, but they also contain irrelevant text which
needs to be processed and filtered.
The words in the web page link have the highest
impact on selecting the correct title for the page.
ACKNOWLEDGEMENTS
The work described in this paper was supported by
MOPIS project, University of Eastern Finland.
REFERENCES
Breiman, L. (2001). Random forests. Machine learning,
45(1), pp.5-32.
Brew, C. and McKelvie, D. (1996). Word-pair extraction
for lexicography. In Proceeding of the second
International Conference on New Methods in Language
Processing, pp. 45–55.
Cai, D., Yu, S., Wen, J. R., & Ma, W. Y. (2003). Vips: a
vision-based page segmentation algorithm (p. 28).
Microsoft technical report, MSR-TR-2003-79. p. 28.
Changuel, S., Labroche, N., & Bouchon-Meunier, B.
(2009). A general learning method for automatic title
extraction from html pages. In Machine Learning and
Data Mining in Pattern Recognition. pp. 704-718.
Springer Berlin Heidelberg.
Fränti, P., Chen, J., Tabarcea, A. (2011) Four Aspects of
Relevance in Sharing Location-based Media: Content,
Time, Location and Network. In WebIST, pp 413-417.
Hu, Y., Xin, G., Song, R., Hu, G., Shi, S., Cao, Y., & Li, H.
(2005). Title extraction from bodies of HTML
documents and its application to web page retrieval. In
Proceedings of the 28th annual international ACM.
SIGIR conference on Research and development in
information retrieval. pp. 250-257. ACM.
Fan, J., Luo, P., & Joshi, P. (2011). Identification of web
article pages using HTML and visual features. In
IS&T/SPIE Electronic Imaging International Society
for Optics and Photonics. pp. 78790K-78790K.
Jeong, O. R., Oh, J., Kim, D. J., Lyu, H., & Kim, W. (2014).
Determining the titles of Web pages using anchor text
and link analysis. Expert Systems with
Applications, 41(9). pp 4322-4329.
Kan, M. Y., & Thi, H. O. N. 2005. Fast webpage
classification using URL features. In Proceedings of the
14th ACM international conference on Information and
knowledge management. pp. 325-326. ACM.
Manning, C. D., & Raghavan, P. H. Sch utze. (2009). An
introduction to information retrieval.
Mohammadzadeh, H., Gottron, T., Schweiggert, F., &
Heyer, G. (2012). Finder: extracting the headline of
news web pages based on cosine similarity and overlap
scoring similarity. In Proceedings of the twelfth
international workshop on Web information and data
management .pp. 65-72. ACM.
Wang, C., Wang, J., Chen, C., Lin, L., Guan, Z., Zhu, J. &
Bu, J. (2009). Learning to extract web news title in
template independent way. In Rough Sets and
Knowledge Technology. pp. 192-199. Springer Berlin
Heidelberg.
Wang, J., Li, G., & Feng, J. (2014). Extending string
similarity join to tolerant fuzzy token matching. ACM
Transactions on Database Systems (TODS), 39(1), 7.
Win, C. S., & Thwin, M. M. S. (2014). Web Page
Segmentation and Informative Content Extraction for
Effective Information Retrieval. IJCCER, 2(2), pp 35-
45.
Xue, Y., Hu, Y., Xin, G., Song, R., Shi, S., Cao, Y., Lin C.
& Li, H. (2007). Web page title extraction and its
application. Information processing & management,
43(5). Pp 1332-1347.