behaviour in general; in Section 3 we define some
specific examples from Google Trends; in Section 4
we explain the factors we have derived to represent
the data; in Section 5 we present the data modelling
techniques used; in Section 6 we explain the
experimental setup and present the classifier and
clustering results; finally, in Section 7 we make
some conclusions about the current work.
2 RELATED WORK
A recent book (Surowiecki, 2005) called “The
Wisdom of Crowds" has become “cult” reading for
Internet marketers. One of its hypotheses is that
sometimes, the majority “know best”. In (Klienberg,
2002), on the other hand, it is proposed that the
nature of user access stabilizes until a given event or
press release in the media (information stream),
provokes a “burst” of interest in a determined topic,
web site or document. (Aizen, 2003) introduces the
concept of “batting average”, which implies that the
proportion of visits that lead to acquisitions is a
measure of an item's popularity. The authors
illustrate its usefulness as a complement to
traditional measures such as visit or acquisition
counts alone. Their study is in the context of movie
and audio downloads in e-commerce sites. They use
a stochastic modelling framework of hidden Markov
models (HMMs) (Rabiner, 1989) to explicitly
represent the underlying download probability as a
“hidden state” in the process, and identifying the
moments when this state changes. Access to specific
documents is also modelled as a function of time.
The study of Internet search queries with a high
frequency is often motivated by the need for the
caching of results by search engine providers. One
such study is that of (Silvestri, 2004), in which hit-
ratio statistics were calculated for different
replacement policies and 'prefetching', to identify an
optimum policy.
On the other hand, there have been different
studies to identify which are the most popular terms
in general used in queries to search engines. The
frequencies are typically summed for a given period
of time. In (Baeza, 2005) the top 10 query terms and
queries were identified for the Chilean TodoCL
search engine. The top 10 queries for this search
engine were cited as being, in descending order of
frequency: {chile, sale, cars, santiago, radios, used,
house, photos, rent, Chilean}. Although this is useful
for the search engine provider to know, it does not
tell us about any trends or changes over time.
(Cacheda, 2001) made a similar study of the BIWE
search engine, and gave the most popular search
terms, such as the following: {sex, free, photos,
mp3, chat, famous} in descending order of
frequency.
Other studies have focussed more from the point
of view of the most popular documents returned by a
query, rather than the popularity of the query terms
themselves, although the two are clearly inter-
related. In (Cho, 2000) it was proposed that the
evolution of the popularity of content pages in
Google search results was influenced by the Page
Rank algorithm itself. It was stated that this is
because most of the user traffic is directed to popular
pages under a search-engine dominant model. The
plot of popularity against time showed a relatively
long period of low frequency followed by an abrupt
increase to a maximum value.
With reference to the application of different
unsupervised learning methods to web log analysis,
in (Nettleton and Baeza, 2008) a web log from the
TodoCL search engine was processed by different
clustering techniques (Fuzzy c-Means, Kohonen and
k-Means). A consensus operator was used to choose
the best predicted category/cluster by majority vote.
The objective was to group similar queries, based on
query search characteristics such as hold times,
ranking of the results clicked, and so on.
The identification of trends in Internet, such as the
top 10 queries by frequency, calculated on a weekly
basis, or the most popular web pages, is already an
area of great interest for the search engine
companies, due to commercial interests. „Google
Trends‟ (Google Trends, 2010) is an application
useful for observing tendencies of the frequency of
user query key words over time. Recently, Google
has made the raw data available to users, which can
be downloaded in an Excel file. This can be
subsequently processed by different statistical
techniques. In the literature, there are several recent
references of academic studies using „Google
Trends‟. For example, (Choi, and Varian, 2010) try
predicting curve trends for the automotive industry
using statistical regression model techniques. Also
(Choi and Varian, 2009) apply similar techniques to
the prediction of initial claims for unemployment
benefits in the U.S. welfare system. In (Rech, 2007),
„Google Trends‟ is used for knowledge discovery
with an application to software engineering.
COLLECTIVE BEHAVIOUR IN INTERNET - Tendency Analysis of the Frequency of User Web Queries
169