
 
Web search engine users employ uppercase letters in 
their queries, their capitalization matches the most 
frequent form on the Web rather randomly. 
However, since word capitalization is highly 
dependent on the context in which a word is used, 
we must also employ higher order n-gram statistics. 
For each token w
i
 in a query q = w
1
…w
i-2
w
i-
1
w
i
w
i+1
w
i+2
…w
n
, we examine the statistics for all 
possible bigrams (left: w
i-1
w
i
 and right: w
i
w
i+1
) and 
trigrams (left: w
i-2
w
i-1
w
i
, middle: w
i-1
w
i
w
i+1
, and 
right: w
i
w
i+1
w
i+2
) that contain it (obviously, some of 
these are undefined for values of n ≤  2 or i Є {0, 1, 
n-1, n}, and thus, cannot be accounted for).  86.8% 
of the bigrams and 59.8% of the trigrams present in 
our queries appear in the corresponding Google n-
gram sets with at least one capitalization. 
To compute the most likely capitalization of a 
token in a given n-gram based on Google’s Web 
data, we aggregate the Google n-gram counts by 
folding the case of all other tokens in the n-gram. 
We find that at bigram level, capitalization of literal 
tokens in our query set matches the most frequent 
capitalization in the Google set 53.6% of the time 
for left bigrams and 55.8% for right bigrams. The 
matching improves to 59.4% for left trigrams, 64.8% 
for middles trigrams, and 62.1% for right trigrams. 
Nonetheless, these numbers are all significantly 
lower than those obtained by hypothesizing that the 
capitalization of all tokens is lowercase (mid to high 
60s). This indicates that users favour lowercase 
forms in queries to a higher degree than as predicted 
by employing Web-based n-gram capitalization 
statistics. 
We now investigate whether capitalization 
information may be useful for ranking, either as 
submitted by users or as predicted based on Web n-
gram data. For the latter, we employ a system that 
truecases each query token by using aggregate 
capitalization counts for all trigrams that contain it, 
with back-off to the bigrams, and finally to unigrams 
when higher-order n-grams are undefined or 
statistics for those n-grams are not available in the 
Google data. Explicitly, for queries with only one 
token, we choose the most frequent capitalization of 
the token in the Google unigram data. For queries 
with two tokens, the system predicts for each token 
the most likely capitalization obtained through the 
process of case folding of the other token and 
aggregation described above. We back-off to 
unigram statistics when the bigram does not appear 
in the Google data set. Similarly, for each token in 
queries of length 3 or more, the system combines the 
counts obtained using the case-folding and aggrega-
tion process for each possible position of the token 
in a trigram (left, middle, and right), with back-off to 
bigrams and unigrams. 
Table 1: Capitalization inter-agreement ratios at query 
level (i.e., the capitalization of all tokens in a query 
matches). 
 
  Annotator 1 Annotator 2  Original  System 
Annotator 1  80% 36% 54% 
Annotator 2 80%   33% 48% 
Original  36%  33%   28% 
System  54%  48%  28%   
Table 2: Capitalization inter-agreement ratios at query 
token level. 
Annotator 1 Annotator 2  Original  System 
Annotator 1    85.5% 61.7% 73.9% 
Annotator 2  85.5%   54.5% 70.0% 
Original  61.7%  54.5%   49.2% 
System  73.9%  70.0%  49.2%   
To estimate how well this truecasing system 
works, we selected 100 queries at random from our 
set (Appendix 1), stripped the case information, and 
asked two annotators to truecase them according to 
their best guess of the original query intent. Tables 1 
and 2 summarize the annotator inter-agreement, as 
well the matching with the original capitalization 
and the system-predicted capitalization at query 
level and token level, respectively. Evidently, 
percentages are much higher when agreement is 
computed at token level, as for two queries to match 
we require that the capitalizations of all component 
tokens match. 
An important observation is that the truecasing 
system based on the Google n-gram data agrees with 
the annotators to a much higher degree than its 
agreement with the original casing of the query, as 
well as the agreement observed between the 
annotators’ capitalizations and the original 
capitalization of the queries. We also note that this 
system predicts a higher number of tokens as 
starting in uppercase than the human annotators 
(64.9% and 78.5% of the disagreements with the two 
annotators at token-level are of this type), which 
may indicate a Web bias towards capitalized forms. 
Finally, we measure the correlation between 
relevance and the matching of capitalization in 
queries and documents. For every query and 
document pair, we compute the percentage of time 
the capitalization of tokens in the query matches the 
capitalization forms of the tokens in the text of the 
document, then we macro-average the obtained 
values, first at query-document level, and then for all 
KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval
304