6 CONCLUSIONS AND FUTURE
WORK
As it has been exposed above, lexical databases are
costly hand made resources that, however, exhibit a
lack of common dayly knowledge such as jargon,
slang and frequent typos. Nevertheles, such terms, be-
cause of their pervasive presence in user Web search
queries, are extremely important to improve the per-
formance of search engines. This fact drove us to re-
search the feasibility of automatically extracting term
taxonomies from those very same queries. Along this
paper we have described an approach with encourag-
ing preliminary results. In fact, it seems that it is not
only possible to achieve such results by only using
query logs but also that it should be possible to at-
tain that in different languages. Therefore, the stated
research questions seem to have a feasible answer
This research also has limitations that should be ad-
dressed in the near future. First, a much more pre-
cise way to identify specialization patterns is needed.
Second, false positives (i.e. incorrectly flagged hy-
ponymy relations) should be filtered out. And third,
an evaluation framework should be envisioned in or-
der to quantify the performance of the method. With
regards to the first issue we have also pointed out that,
at this moment, only lexical clues are employed to de-
tect specialization but we plan to reproduce the work
by (Boldi et al., 2009) where they describe a machine
learning method to detect much subtle specializations
(e.g. labrador and dog). Regarding the second is-
sue, we have explored a na
¨
ıve heuristic based on the
position where modifiers occur in relation to the hy-
pernym (i.e. they are pre- or post-modifiers). As we
pointed before, in the English language such modi-
fiers tend to precede the hypernym (e.g. tropical
fish, blue fish, recently caught fish) and,
hence, it could be rather simple to remove most of
the false positives. This could work in other lan-
guages but, certainly, it would not be language inde-
pendent. However, statistical methods could perhaps
be applied to these trivial specializations to discover
the most common position of modifiers in order to
adapt the application of the heuristic. Finally, with re-
gards to the third issue on the necessity of a evaluation
framework, we will probably start relying on Word-
net although we have already pointed out the lack of
specialized knowledge and slang in that database. On
the other hand, we believe that many pairs would be,
in fact, instances and not hyponyms (e.g. angelina
jolie and celebrity) which could be really diffi-
cult to evaluate by simply using Wordnet. Hopefully,
in future works we will be able to shed light on such
issues.
REFERENCES
Baeza-Yates, R. and Tiberi, A. (2007). Extracting seman-
tic relations from query logs. In KDD ’07: Proc. of
the 13th ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 76–85,
New York, NY, USA. ACM.
Berland, M. and Charniak, E. (1999). Finding parts in very
large corpora. In Proceedings of the 37th annual meet-
ing of the Association for Computational Linguistics
on Computational Linguistics, pages 57–64.
Boldi, P., Bonchi, F., Castillo, C., Donato, D., and Vi-
gna, S. (2009). Query suggestions using query-flow
graphs. In WSCD ’09: Proc. of the 2009 workshop on
Web Search Click Data, pages 56–63, New York, NY,
USA. ACM.
Broder, A. (2002). A taxonomy of web search. SIGIR Fo-
rum, 36(2):3–10.
Caraballo, S. A. (1999). Automatic construction of a
hypernym-labeled noun hierarchy from text. In Pro-
ceedings of the 37th annual meeting of the Association
for Computational Linguistics on Computational Lin-
guistics, pages 120–126, Morristown, NJ, USA. As-
sociation for Computational Linguistics.
Chuang, S.-L. and Chien, L.-F. (2003). Enriching web
taxonomies through subject categorization of query
terms from search engine logs. Decis. Support Syst.,
35(1):113–127.
Chuang, S.-L. and Chien, L.-F. (2004). A practical web-
based approach to generating topic hierarchy for text
segments. In CIKM ’04: Proc. of the thirteenth ACM
international conference on Information and knowl-
edge management, pages 127–136, New York, NY,
USA. ACM.
Chuang, S.-L. and Chien, L.-F. (2005). Taxonomy gener-
ation for text segments: A practical web-based ap-
proach. ACM Trans. Inf. Syst., 23(4):363–396.
Clough, P., Joho, H., and Sanderson, M. (2005). Automati-
cally organising images using concept hierarchies,. In
Proc. of the SIGIR Workshop on Multimedia Informa-
tion Retrieval.
Fallows, D. (2008). Almost half of all internet users
now use search engines on a typical day. Tech-
nical report, Pew Internet & American Life
Project. Accessed 6 February 2009. Available at:
http://www.pewinternet.org/pdfs//PIP Search Aug08.pdf.
Gabrilovich, E. and Markovitch, S. (2007). Harnessing the
expertise of 70,000 human editors: Knowledge-based
feature generation for text categorization. J. Mach.
Learn. Res., 8:2297–2345.
Gayo-Avello, D. (2009). A survey on session detection
methods in query logs and a proposal for future eval-
uation. Inf. Sci., 179(12):1822–1843.
Girju, R., Badulescu, A., and Moldovan, D. (2003). Learn-
ing semantic constraints for the automatic discovery
of part-whole relations. In NAACL ’03: Proc. of the
2003 Conference of the North American Chapter of
the Association for Computational Linguistics on Hu-
man Language Technology, pages 1–8, Morristown,
NJ, USA. Association for Computational Linguistics.
KDIR 2009 - International Conference on Knowledge Discovery and Information Retrieval
234