detailed description of our location alignment
procedure. For readers not specifically interested in
implementing similar experiments, a skim of this
section will suffice.
As noted in the previous section, the ground-
truth and query-analysis results are built on two
separate namespaces. To try to bring the two
together, we passed each city-to-city name through a
geo-coder supplied by http://maps.google.com. Each
location was translated into latitude and longitude
coordinates. We then reverse geo-coded each of
these 12,873 latitude/longitude coordinates, mapping
each to the city-level “localities” used by
Google.com. These forward-reverse geo-coding
operations resulted in many-to-one assignments
from the 12,873 city-to-city lists to Google
localities. While all of these mapped into the Cities
G
set (for which we have query-log data), the mapping
was such that they mapped onto only 8093 Google
localities. Many-to-one mappings typically occurred
when counties from the city-to-city lists mapped to
the same locality as one of the also-referenced cities
within the county. In addition, we also saw within-
city neighborhood names from the city-to-city lists
mapping onto the same city-level locality.
To handle this ambiguity, we first consider how
we will evaluate the quality of our similarity results.
In Section 6, we provide a variety of analysis tools.
In all cases, we cycle through a list of target cities,
measuring for each our similarity results (given by
logs analysis) compared to what was given by the
ground truth (given by the city-to-city lists).
We define the “target city” as the city-to-city
location to which we want to measure or rank all the
known cities. In order to decide which Google
locality to use as the target city, we simply use the
forward/reverse geo-code mapping described above.
This gives us a pair of vectors to compare: one from
the city-to-city namespace, containing sparse
mappings within a set of 12,872 other city-to-city
names and the other from the Google locality
namespace, giving us a dense mapping to 8093 other
Google localities that are their geo-coding-based
“partners.” Note that, based on the list occurrence
requirements mentioned in the previous subsection,
there are 8123 target cities, which then map onto
only 4478 Google localities. The full 8123-target set
is distinct since the target city data is evaluated as a
pair and the city-to-city sparse mappings will be
different for each of the 8123 names, even when the
query-analysis mappings are repeated by the many-
o-one nature of the association.
For each of these target cases, we need to
compare a sparse association mapping into the
12,872 city-to-city namespace with a dense mapping
into the 8093 Google query-stream localities. For
many Google localities, there is only one city-to-city
name with a default mapping onto that Google
locality name. We fix these mappings as our first
step. For the remaining Google localities, where the
association to a city-to-city name is ambiguous, we
use an optimistic mapping onto city-to-city names
that have not already been used. The greedy
mapping starts from the most strongly associated
Google locality (in the ambiguous set) and picking
(from the city-to-city set within the 30km radius) the
most strongly associated city-to-city name that has
not already been used.
This mapping is used in all evaluations,
including the baseline orderings by geographic
distance, total population, and population difference
(Section 6). Despite this bias to closer alignments,
by looking at relative performance where all
alternatives share the same optimistic advantages,
we avoid overstating results in any direction.
3 FEATURE SPACE
One of the difficulties in comparing queries, even
after standard normalization steps are taken, is that
queries that may initially appear to be far apart, in
terms of spelling and edit distance, can represent the
same concept. For example, the terms “auto” and
“cars” are often used interchangeably, as are “coke”
and “pop” or “mobile” and “phone.” To treat these
sets of queries as similar, we replace each query
with a concept cluster.
Concept clusters are based on a large-scale
Bayesian network model of text, as detailed in
(Datta, 2005); (Harik and Shazeer, 2004). Datta
describes the creation of PHIL (probabilistic
hierarchical inferential learner). Although a full
explanation of the PHIL system is beyond the scope
of this paper, a cursory overview is provided here.
PHIL is a top-down weighted directed acyclic graph
in which the top node represents “all concepts” and
the leaf nodes represent individual words or
compound-word tokens. The intermediate nodes,
which can be learned through word co-occurrence
statistics over large text corpora, are created
automatically. The intermediate nodes contain many
conceptually similar words. PHIL was originally
used as a generative model of text. For our
purposes, each query is used as input to the system,
and the intermediate nodes that are most highly
activated are assigned to the query. Similar concept
queries will activate similar nodes. Interestingly, this
KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval
182