example, the terms “speaks” and “speaking”,
resulting from a combination of a sole root with two
different suffixes (“s” and “ing”), are brought back
to the same lemma “speak”.
With the aid of a dictionary we calculated that
4,135 lower case queries were not in lemmatised
form (Table 3). The percentage is lower in upper
case queries (31.03%) as most of these terms are
abbreviations or person and organization names (see
Table 3). Subtle differences in queries (e.g.
“Πανεπιστήμιο Αθήνας”, “Πανεπιστήμιο Αθηνών” –
University of Athens) are capable of differentiating
the retrieved pages in Google, Yahoo and in the
other international and even national search engines,
which supposedly have a better understanding of the
Greek language.
Table 3: Number of non lemmatized queries.
Lower case queries in
non lemmatised form
Upper case queries in
non lemmatised form
4,135 – 88.54% 319 – 31.03%
Lemmatization would be quite helpful in Greek
Web searching since most of the queries and
obviously Web pages are not in lemmatised form
and their matching is apparently not possible.
2.2.5 Stopwords
Stopwords are the terms which appear too frequently
in documents and thus their discriminatory value is
low (van Rijsbergen, 1979). Elimination of
stopwords is one of the first stages in typical
information retrieval systems. In English Web
searching stopwords are removed or they do not
influence the retrieval process significantly.
Stopword lists have been constructed for most of the
major European languages (see http://snowball.
tartarus.org for example) and they could be utilized
by search engines. Such a listing does not exist for
the Greek language. Usual candidates of the
stopword list are articles, prepositions and
conjunctions (Baeza-Yates & Ribeiro-Neto, 1999).
Using all 5,698 lower and upper case queries we
identified the articles, prepositions and conjunctions
existing in our query collection. Such common
words exist in 1,516 queries. That is 26.61% of the
queries contain common words. These words
occurred 2,032 times within these 1,516 queries.
Thus they account for the 14.42% of the total words
of the Greek queries.
These statistics indicate that users do utilize
common words in their queries and therefore the
construction of a Greek stopword list and its
application to Web retrieval should be further
studied.
2.2.6 Other Issues
Although the analysis of the data is still in progress,
the most important issues were discussed above. A
number of other issues were also identified by
observing the user queries but they have not been
thoroughly examined as yet.
A number of queries in the English part
contained the string “www” or were in a semi url
form. For instance, a user typed the query “travel to
Greece.gr”. This is an indication that some users are
not competent in search engine usage. Proper
training or presentation of proper examples on the
search engine’s page could help users work out their
misconceptions.
By inspecting the first 100 queries of the sample
we located 3 spelling errors. We run these queries in
Google and we got either no results or pages with
the same spelling errors as in the query. International
search engines aid English users even in spelling
errors with “Did you mean” tips. For instance,
Yahoo presents the message “Did you mean:
confidentiality” if a user types the word
“confidentiallity” in its searching box.
In 12 Greek queries the “*” wildcard was used at
the end of the query. As known, users get no
additional results if they use wildcards. Additionally,
the wildcard was not properly used as a space was
included between the wildcard and the last word.
This observation, along with the inclusion of “www”
in the queries, is an indication that a few search
engine users are confused and therefore training is
needed.
“GreekEnglish” is a term shared among Greek
Internet users. It refers to the typing of Greek words
using English characters. For example, the word
“Athina” in GreekEnglish, is the word “Αθήνα” in
Greek and “Athens” in English. GreekEnglish
originates from the time Greek were not supported
in some operating systems or in e-mail clients and it
was invented as a communication means so as to
assure readability. Several users still follow this
logic. We observed several instances of
GreekEnglish queries in our sample. However, it
cannot be decided whether it was a conscious action
or this behavior results, again, from user
misconceptions about the ability to use or not Greek
characters in searching. Advanced options such as
site or file specification were sporadically detected.
However, we cannot derive valid conclusions from
this finding since queries are submitted to Google
WEBIST 2007 - International Conference on Web Information Systems and Technologies
406