Web search engine users employ uppercase letters in
their queries, their capitalization matches the most
frequent form on the Web rather randomly.
However, since word capitalization is highly
dependent on the context in which a word is used,
we must also employ higher order n-gram statistics.
For each token w
i
in a query q = w
1
…w
i-2
w
i-
1
w
i
w
i+1
w
i+2
…w
n
, we examine the statistics for all
possible bigrams (left: w
i-1
w
i
and right: w
i
w
i+1
) and
trigrams (left: w
i-2
w
i-1
w
i
, middle: w
i-1
w
i
w
i+1
, and
right: w
i
w
i+1
w
i+2
) that contain it (obviously, some of
these are undefined for values of n ≤ 2 or i Є {0, 1,
n-1, n}, and thus, cannot be accounted for). 86.8%
of the bigrams and 59.8% of the trigrams present in
our queries appear in the corresponding Google n-
gram sets with at least one capitalization.
To compute the most likely capitalization of a
token in a given n-gram based on Google’s Web
data, we aggregate the Google n-gram counts by
folding the case of all other tokens in the n-gram.
We find that at bigram level, capitalization of literal
tokens in our query set matches the most frequent
capitalization in the Google set 53.6% of the time
for left bigrams and 55.8% for right bigrams. The
matching improves to 59.4% for left trigrams, 64.8%
for middles trigrams, and 62.1% for right trigrams.
Nonetheless, these numbers are all significantly
lower than those obtained by hypothesizing that the
capitalization of all tokens is lowercase (mid to high
60s). This indicates that users favour lowercase
forms in queries to a higher degree than as predicted
by employing Web-based n-gram capitalization
statistics.
We now investigate whether capitalization
information may be useful for ranking, either as
submitted by users or as predicted based on Web n-
gram data. For the latter, we employ a system that
truecases each query token by using aggregate
capitalization counts for all trigrams that contain it,
with back-off to the bigrams, and finally to unigrams
when higher-order n-grams are undefined or
statistics for those n-grams are not available in the
Google data. Explicitly, for queries with only one
token, we choose the most frequent capitalization of
the token in the Google unigram data. For queries
with two tokens, the system predicts for each token
the most likely capitalization obtained through the
process of case folding of the other token and
aggregation described above. We back-off to
unigram statistics when the bigram does not appear
in the Google data set. Similarly, for each token in
queries of length 3 or more, the system combines the
counts obtained using the case-folding and aggrega-
tion process for each possible position of the token
in a trigram (left, middle, and right), with back-off to
bigrams and unigrams.
Table 1: Capitalization inter-agreement ratios at query
level (i.e., the capitalization of all tokens in a query
matches).
Annotator 1 Annotator 2 Original System
Annotator 1 80% 36% 54%
Annotator 2 80% 33% 48%
Original 36% 33% 28%
System 54% 48% 28%
Table 2: Capitalization inter-agreement ratios at query
token level.
Annotator 1 Annotator 2 Original System
Annotator 1 85.5% 61.7% 73.9%
Annotator 2 85.5% 54.5% 70.0%
Original 61.7% 54.5% 49.2%
System 73.9% 70.0% 49.2%
To estimate how well this truecasing system
works, we selected 100 queries at random from our
set (Appendix 1), stripped the case information, and
asked two annotators to truecase them according to
their best guess of the original query intent. Tables 1
and 2 summarize the annotator inter-agreement, as
well the matching with the original capitalization
and the system-predicted capitalization at query
level and token level, respectively. Evidently,
percentages are much higher when agreement is
computed at token level, as for two queries to match
we require that the capitalizations of all component
tokens match.
An important observation is that the truecasing
system based on the Google n-gram data agrees with
the annotators to a much higher degree than its
agreement with the original casing of the query, as
well as the agreement observed between the
annotators’ capitalizations and the original
capitalization of the queries. We also note that this
system predicts a higher number of tokens as
starting in uppercase than the human annotators
(64.9% and 78.5% of the disagreements with the two
annotators at token-level are of this type), which
may indicate a Web bias towards capitalized forms.
Finally, we measure the correlation between
relevance and the matching of capitalization in
queries and documents. For every query and
document pair, we compute the percentage of time
the capitalization of tokens in the query matches the
capitalization forms of the tokens in the text of the
document, then we macro-average the obtained
values, first at query-document level, and then for all
KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval
304