the method of analysing search behaviour and
information needs and on the patterns that can be
observed when examining search queries over time.
Studying table 2, one can come to the conclusion
that jupiter is a rather common query. This is only
partly true; jupiter is indeed a frequently used term
but not a frequently used query. In fact, “jupiter” as
a stand-alone term occurs only in 34 of the 2,782
queries that includes the term jupiter. In 98.78% of
the jupiter-related queries, the term jupiter is
combined with other terms, which can be seen also
from tables 3 and 4. The term jupiter does thus not
represent the information need; this can instead be
found in the other part of the pair (such as in “jupiter
lift”) or triplet (such as in “jupiter golf
competition”). So although table 2 is correct in a
statistical sense, such listing of individual terms may
skew the understanding of the search behaviour.
Term frequency lists are presented in much of the
published research in this area (cf. Jansen & Spink,
2005; Spink et al., 2001; Jansen et al., 2000), but we
argue it may be better to instead list the most
frequently submitted queries or to include the most
frequently used pairs and triples, as do Chau et al.
(2005). Only half of the most frequently used terms
overlapped with the most frequently submitted
queries. If we see differences between term
frequencies and query frequencies already on an
intranet where the average query length is 1.44 terms
and 69% of the queries are single term queries
(Stenmark, 2005b; 2006), this difference would
probably be even more evident on the public web
where the average query length is closer to 2.5
terms. This further underlines the need to look
beyond mere query term analysis when trying to
understand the information needs of search engine
users.
As in Chau et al.’s (2005) study, our study shows
that the frequencies for the highest ranked term pair
is considerably lower than the frequency of the
highest ranked term, and that the frequency for most
sought for triplet is lower still. We also note the drop
is much more pronounced in our data than in Chau
et al.’s study. In addition, the slope of the Zipf plots
in figure 1 is not as steep as theory would have it.
These observations suggest that a larger portion of
single term queries are used at Jupiter. Referring to
Fagin et al. (2003), we suggest that this is because
intranets contain more jargon and more acronyms
than do the public web. Another possible
explanation suggested by Stenmark (2005b; 2006) is
the presence of Swedish terms. The Swedish
language makes use of compound words, resulting
in single terms where e.g. English would have used
two terms.
We were expecting there would be more unique
search terms on a general-purpose search engine
than on a site-specific one, but Jansen et al.’s (2000)
slope of -0.975 for Excite terms is very close to
Chau and colleagues’ slope of -0.9533 for the Utah
search engine. A single web site can be expected to
be more narrow in coverage and thus have a more
limited vocabulary, and we were expected this to
show in the distribution of search words. We had
originally been expecting the Zipf plot of an intranet
search engine to fall somewhere in between the Utah
and the Excite plots but now our slopes of around -
0.85 are less steep than both the other. We posit that
the Swedish way of constructing compound words
make the number of terms grow quicker than the
frequency, hence producing these results. Additional
(linguistic) analysis is required to fully understand
this issue. It would be interesting to compare our
findings to those from other intranet using other
languages, say Finnish or English, to try to establish
what is intranet dependent and what dependents on
the language.
As was evident from table 1, the top terms
portions of the total are pretty consistent over the
years, i.e. a relatively small subset of the terms is
used again and again. The portion of hapaxes (i.e.,
not repeated words) is not equally stable, although
the variances are rather small. Close to 60% of the
query terms are used only once, but since the
repeated words are sometimes used very frequently,
the hapaxes only make up some 15-19% of the total
corpus. Still, 15-19% is a significant portion and it
indicates that the information need is focused on
quite a narrow field. When studying the top-100
terms, we noted that although more than half of the
terms were present only in one year, some 17% of
the terms reappeared every year. This distribution
holds also for the top-10 terms. The corresponding
numbers for the top-100 queries are similar; some
12% of the queries are found across all years.
Apparently, there are things that the Jupiter
employees continue to search for year after year,
indicating what we mean is a long-term information
need. Information about such needs would be useful
to information providers and site designers within
the organisation. Chau et al. (2005) argue that such
frequently sought-for information should be made
accessible via prominently placed links.
However, we see that the portions of terms and
queries not repeated are bigger and we posit that the
large portion of unique terms and unique queries
indicate that there is a shift in information seeking
behaviour from year to year. These queries may
indicate the short-term information needs. These
needs may be further be seasonal, as suggested by
Chau et al. (2005). It seems plausible the
information about the Jupiter golf competition will
be more attractive closer to the actual event. The
shift in information needs that this data suggest may
ANALYSING TERMS, PAIRS, TRIPLETS AND FULL QUERIES USED IN INTRANET SEARCHING
127