
query term in a document. The rationale behind the 
use of tf is that the more occurrences the query term 
has in a given document the more likely it is that the 
document is relevant to the input query. If synonymy 
is taken into account by summing up the term fre-
quencies of synonyms in a document, more accurate 
relevance scores are achieved in comparison to a 
conventional approach where synonymy is not taken 
into account. This can be illustrated using a simple 
example. Consider a page containing the following 
phrases with each having one occurrence on the 
page: sea level rise, rising sea level, and rising sea 
levels. In the conventional situation where the user 
only uses base form query term (i.e., sea level rise) 
and the term frequencies of synonyms are not 
summed up tf=1, whereas when the term frequencies 
are summed up  tf=3, which is more realistic be-
cause the concept denoted by the three phrases in-
deed appears three times on the page. Authors (of 
Web pages) tend to use alternative phrases and do 
not only use base form terms but also different syn-
tactic and morphological forms. This means that 
many important documents are ranked lower than 
they actually deserve if synonymy is not taken into 
account. 
8 CONCLUSIONS 
We described a method to construct a topic-specific 
dictionary of real-text phrases to support query for-
mulation in Web searching, and presented the exist-
ing climate change RT-dictionary. The proposed 
method is a general method and can be applied to 
any reasonable topic. We argued that there is need 
for such search assistances due to the difficulty to 
formulate queries in particular for complex informa-
tion needs. In the experimental part of this paper we 
showed that the proposed importance score is a good 
indicator of search success. 
The further development of RT-dictionary was 
discussed in the previous sections. Our plan is also 
to construct RT-dictionaries for new topics and to 
add multilingual features to the RT-dictionary. 
ACKNOWLEDGEMENTS  
This study was funded by the Academy of Finland 
(research projects 130760, 218289). 
 
REFERENCES 
Belkin, N. J., Oddy, R. N., Brooks, H. M., 1982. ASK for 
information retrieval: Part I. Background and history. 
Journal of Documentation, 38 (2), 61-71. 
Bergmark, D., Lagoze, C. and Sbityakov, A., 2002. Fo-
cused crawls, tunneling, and digital libraries.  Proc. of 
the 6th European Conference on Research and Ad-
vanced Technology for Digital Libraries, Rome, Italy, 
September 16-18, pp. 91-106. 
Chakrabarti, S., van den Berg, M. and Dom, B., 1999. 
Focused crawling: a new approach to topic-specific 
Web resource discovery. Proc. of the Eighth Interna-
tional World Wide Web Conference, Toronto, Canada, 
May 11-14, pp. 1623-1640. 
Cronen-Townsend, S., Zhou, Y. and Croft, B., 2002. 
Predicting query performance. Proc. of the 28th ACM 
SIGIR Conference on Research and Development in 
Information Retrieval, Tampere, Finland, August 11-
15, pp. 299-306. 
Diligenti, M., Coetzee, F. M., Lawrence, S., Giles, C.L. 
and Gori, M., 2000. Focused crawling using context 
graphs. Proc. of the 26th International Conference on 
Very Large Databases (VLDB), Cairo, Egypt, Septem-
ber 10-14, pp. 527-534. 
El-Beltagy, S. and Rafea, A., 2009. KP-Miner: A key-
phrase extraction system for English and Arabic 
documents. Information Systems, 34(1), 132-144. 
He, B. and Ounis, I., 2006. Query performance prediction. 
Information Systems, 31(7), 585-594. 
Ingwersen, P. and Järvelin, K., 2005. The Turn: Integra-
tion of Information Seeking and Retrieval in Context. 
Heidelberg, Springer. 
Jaene, H. and Seelbach, D., 1975. Maschinelle Extraktion 
von zusammengesetzten Ausdrücken aus englischen 
Fachtexten. Report ZMD-A-29. Beuth Verlag, Berlin. 
Jansen, B. J., Spink, A. and Saracevic, T., 2000. Real life, 
real users, and real needs: A study and analysis of user 
queries on the Web. Information Processing & Man-
agement, 36(2), 207-227. 
Lee, H. J., 2008. Mediated information retrieval in Web 
searching. Proc. of the American Society for Informa-
tion Science and Technology, 45(1), pages 1-10. 
Muresan, G. and Harper, D. J. 2004., Topic modeling for 
mediated access to very large document collections. 
Journal of the American Society for Information Sci-
ence and Technology, 55 (10), 892-910. 
Perez-Iglesias, J. and Araujo. L., 2010. Standard deviation 
as a query hardness estimator. The 17th International 
Symposium on String Processing and Information Re-
trieval (SPIRE 2010), Los Cabos, Mexico, October 
11-13, pp. 207-212. 
Pirkola, A., 2011a. Constructing topic-specific search 
keyphrase suggestion tools for Web information re-
trieval. Proc. of the 12th International Symposium on 
Information Science (ISI 2011), Hildesheim, Germany, 
March 9-11,  pp. 172-183. 
Pirkola, A., 2011b. A Web search system focused on 
climate change. Digital Proceedings, Earth Observa-
TOPIC-SPECIFICWEBSEARCHINGBASEDONAREAL-TEXTDICTIONARY
297