3. Insert the unseen set into the Adjacency
Matrix, the unseen rows become: Doc(i), all
numerators and denominators = 1;
4. For each ngram in the Seen set, append the
Unseen set, and initialize all numerator =
denominator = 1;
5. For each ngram in Seen set, update the old row
with denominator++; and if the old word is in
Doc(i) then numerator++.
6. In order to guarantee strictly linear behaviour
wrt Jaccard updates, we reuse the incremental
update regime detailed in the context of DSI
updates in section 3.3.1. Namely, we maintain
a maximum of K’ candidates in each word’s
adjacency list; these are used to “bubble up”
the final K top relationships for each word. K’
> K.
Step S7 is actually an optimization/pre-
processing step to speed up S8, and so we first
describe S8, which is the transitive closure over the
graph obtained from S6.
At the output of S6, we have a pairwise
connected weighted graph represented by the
adjacency matrix. From this it is possible to compute
connected components and semantic clusters and
neighbourhoods. Indeed, that is what we do in S7.
However, in order to capture the latent semantic
relationships, it is necessary to consider transitive
relationships between ngram nodes. If we fail to do
that, the semantic inference available from the entire
process tends to be locked into “vocabulary silos”.
We compute the transitive closure over this
graph using the classic Floyd-Warshall algorithm, by
defining the relaxation step to be the maximum over
edge weight between vertices i, j, and intermediate
vertex k: e(i, k) x e(k, j). This exposes the latent
semantic relationships, and gives us functional
equivalence with LSI. The end result is an entity
relatedness graph with edge weights representing
relationship strengths between any pair of vertices.
Note that our edge weights are symmetrical, and so
we only need to deal with the upper triangular half
of the adjacency matrix, discounting diagonal entries
(each node is trivially related to itself).
Theorem: Given a Graph G, defining the
relaxation step on an edge connecting vertices V
i
and
V
j
as: e(i, j) = Max{e(i, j), e(i, k) x e(k, j)} where the
base edge weights represent Jaccard similarity, is
necessary and sufficient to induce transitive closure
over G, exposing latent semantic relationships.
Proof: If Jaccard similarity is considered as an
approximation of probability that nodes V
i
and V
j
are
related, and if the relationship between any two
(non-identical) pairs of vertices (e.g. {V
i
, V
j
} and
{V
i
, V
k
}) is independent, then their joint probability
is the result of multiplying independent probabilities.
But this is the same as multiplying edge weights
stemming from Jaccard similarity.
To complete the running example, here are the
outputs of the Transitive Closure step. Latent
relationships that emerge after transitive closure are
highlighted.
cat: {(cheese, 0.33), (mouse, 0.33), (cat cheese,
0.5), (cat mouse, 0.5), (mouse cheese, 0.165)}
cheese: {(cat, 0.33), (mouse, 0.33), (cat cheese,
0.5), (mouse cheese, 0.5), (cat mouse, 0.165)}
mouse: {(cat, 0.33), (cheese, 0.33), (mouse
cheese, 0.5), (cat mouse, 0.5), (cat cheese,
0.165)}
cat cheese: {(cat, 0.5), (cheese, 0.5), (mouse,
0.165), (mouse cheese, 0.25), (cat mouse,
0.25)}
mouse cheese: {(cheese, 0.5), (mouse, 0.5),
(cat, 0.165), (cat cheese, 0.25), (cat mouse,
0.25)}
cat mouse: {(cat, 0.5), {mouse, 0.5), (cheese,
0.165), (cat cheese, 0.25), (mouse cheese, 0.25)}
3.4.3 Transitive Closure Optimizations
Transitive Closure on a graph is an expensive O(N
3
)
operation. We outline a series of optimizations that
reduce the end-to-end complexity.
First, we enforce sparseness on the graph from
S6 by clamping weights, e(i,j), that are less than a
sparseness threshold, T, to zero: If (e(i, j) < T e(i, j))
= 0, then e(i,j) = 0.
We next run connected components algorithm on
the sparse graph. Connected components is O(N).
Once the M components are identified, the transitive
closure on the entire graph reduces to running
transitive closure on the M components. Thus, we
transform the overall complexity from O(N
3
) to
)(
0
3
∑
=
M
i
i
nO
.
A spin-off data structure of connected
components is the Component Index, which is
actually a hierarchy of embedded sub-components,
corresponding to different values of sparseness
threshold. This enables us to work with components
and clusters later on.
We now state without formal proof, a conjecture
based on substantive empirical evidence, that
actually results in completely eliminating the
expensive transitive closure operation for large
corpuses. Either transitive closure is necessary and
the document set is small, in which case it is cheap
KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval
154