query term in a document. The rationale behind the
use of tf is that the more occurrences the query term
has in a given document the more likely it is that the
document is relevant to the input query. If synonymy
is taken into account by summing up the term fre-
quencies of synonyms in a document, more accurate
relevance scores are achieved in comparison to a
conventional approach where synonymy is not taken
into account. This can be illustrated using a simple
example. Consider a page containing the following
phrases with each having one occurrence on the
page: sea level rise, rising sea level, and rising sea
levels. In the conventional situation where the user
only uses base form query term (i.e., sea level rise)
and the term frequencies of synonyms are not
summed up tf=1, whereas when the term frequencies
are summed up tf=3, which is more realistic be-
cause the concept denoted by the three phrases in-
deed appears three times on the page. Authors (of
Web pages) tend to use alternative phrases and do
not only use base form terms but also different syn-
tactic and morphological forms. This means that
many important documents are ranked lower than
they actually deserve if synonymy is not taken into
account.
8 CONCLUSIONS
We described a method to construct a topic-specific
dictionary of real-text phrases to support query for-
mulation in Web searching, and presented the exist-
ing climate change RT-dictionary. The proposed
method is a general method and can be applied to
any reasonable topic. We argued that there is need
for such search assistances due to the difficulty to
formulate queries in particular for complex informa-
tion needs. In the experimental part of this paper we
showed that the proposed importance score is a good
indicator of search success.
The further development of RT-dictionary was
discussed in the previous sections. Our plan is also
to construct RT-dictionaries for new topics and to
add multilingual features to the RT-dictionary.
ACKNOWLEDGEMENTS
This study was funded by the Academy of Finland
(research projects 130760, 218289).
REFERENCES
Belkin, N. J., Oddy, R. N., Brooks, H. M., 1982. ASK for
information retrieval: Part I. Background and history.
Journal of Documentation, 38 (2), 61-71.
Bergmark, D., Lagoze, C. and Sbityakov, A., 2002. Fo-
cused crawls, tunneling, and digital libraries. Proc. of
the 6th European Conference on Research and Ad-
vanced Technology for Digital Libraries, Rome, Italy,
September 16-18, pp. 91-106.
Chakrabarti, S., van den Berg, M. and Dom, B., 1999.
Focused crawling: a new approach to topic-specific
Web resource discovery. Proc. of the Eighth Interna-
tional World Wide Web Conference, Toronto, Canada,
May 11-14, pp. 1623-1640.
Cronen-Townsend, S., Zhou, Y. and Croft, B., 2002.
Predicting query performance. Proc. of the 28th ACM
SIGIR Conference on Research and Development in
Information Retrieval, Tampere, Finland, August 11-
15, pp. 299-306.
Diligenti, M., Coetzee, F. M., Lawrence, S., Giles, C.L.
and Gori, M., 2000. Focused crawling using context
graphs. Proc. of the 26th International Conference on
Very Large Databases (VLDB), Cairo, Egypt, Septem-
ber 10-14, pp. 527-534.
El-Beltagy, S. and Rafea, A., 2009. KP-Miner: A key-
phrase extraction system for English and Arabic
documents. Information Systems, 34(1), 132-144.
He, B. and Ounis, I., 2006. Query performance prediction.
Information Systems, 31(7), 585-594.
Ingwersen, P. and Järvelin, K., 2005. The Turn: Integra-
tion of Information Seeking and Retrieval in Context.
Heidelberg, Springer.
Jaene, H. and Seelbach, D., 1975. Maschinelle Extraktion
von zusammengesetzten Ausdrücken aus englischen
Fachtexten. Report ZMD-A-29. Beuth Verlag, Berlin.
Jansen, B. J., Spink, A. and Saracevic, T., 2000. Real life,
real users, and real needs: A study and analysis of user
queries on the Web. Information Processing & Man-
agement, 36(2), 207-227.
Lee, H. J., 2008. Mediated information retrieval in Web
searching. Proc. of the American Society for Informa-
tion Science and Technology, 45(1), pages 1-10.
Muresan, G. and Harper, D. J. 2004., Topic modeling for
mediated access to very large document collections.
Journal of the American Society for Information Sci-
ence and Technology, 55 (10), 892-910.
Perez-Iglesias, J. and Araujo. L., 2010. Standard deviation
as a query hardness estimator. The 17th International
Symposium on String Processing and Information Re-
trieval (SPIRE 2010), Los Cabos, Mexico, October
11-13, pp. 207-212.
Pirkola, A., 2011a. Constructing topic-specific search
keyphrase suggestion tools for Web information re-
trieval. Proc. of the 12th International Symposium on
Information Science (ISI 2011), Hildesheim, Germany,
March 9-11, pp. 172-183.
Pirkola, A., 2011b. A Web search system focused on
climate change. Digital Proceedings, Earth Observa-
TOPIC-SPECIFICWEBSEARCHINGBASEDONAREAL-TEXTDICTIONARY
297