SIMPLEX DECOMPOSITIONS USING SVD AND PLSA
Madhusudana Shashanka and Michael Giering
United Technologies Research Center, East Hartford, CT 06108, U.S.A.
Keywords: Matrix factorization, Probabilistic Latent Semantic Analysis (PLSA), Nonnegative Matrix Factorization
(NMF), Singular Values Decomposition (SVD), Principal Components Analysis (PCA).
Abstract:
Probabilistic Latent Semantic Analysis (PLSA) is a popular technique to analyze non-negative data where
multinomial distributions underlying every data vector are expressed as linear combinations of a set of basis
distributions. These learned basis distributions that characterize the dataset lie on the standard simplex and
themselves represent corners of a simplex within which all data approximations lie. In this paper, we describe
a novel method to extend the PLSA decomposition where the bases are not constrained to lie on the standard
simplex and thus are better able to characterize the data. The locations of PLSA basis distributions on the
standard simplex depend on how the dataset is aligned with respect to the standard simplex. If the directions
of maximum variance of the dataset are orthogonal to the standard simplex, then the PLSA bases will give
a poor representation of the dataset. Our approach overcomes this drawback by utilizing Singular Values
Decomposition (SVD) to identify the directions of maximum variance, and transforming the dataset to align
these directions parallel to the standard simplex before performing PLSA. The learned PLSA features are
then transformed back into the data space. The effectiveness of the proposed approach is demonstrated with
experiments on synthetic data.
1 INTRODUCTION
The need for analyzing non-negative data arises in
several applications such as computer vision, seman-
tic analysis and gene expression analysis among oth-
ers. Nonnegative Matrix Factorization (NMF) (Lee
and Seung, 1999; Lee and Seung, 2001) was specifi-
cally proposed to analyze such data where every data
vector is expressed as a linear combination of a set of
characteristic basis vectors. The weights with which
these vectors combine differ from data point to data
point. All entries of the basis vectors and the weights
are constrained to be nonnegative. The nonnegativity
constraint produces basis vectors that can only com-
bine additively without any cross-cancellations and
thus can be intuitively thought of as building blocks
of the dataset. Given these desirable properties, the
technique has found wide use across different appli-
cations. However, one of the main drawbacks of NMF
is that the energies of data vectors is split between the
basis vectors and mixture weights during decomposi-
tion. In other words, the basis vectors may lie in an
entirely different part of the data space making any
geometric interpretation meaningless.
Probabilistic Latent Semantic Analysis (PLSA)
(Hofmann, 2001) is a related method with probabilis-
tic foundations which was proposed around the same
time in the context of semantic analysis of document
corpora. A corpus of documents is represented as
a matrix where each column vector corresponds to
a document and each row corresponds to a word in
the vocabulary and the entry corresponds to the nu-
mer of times the word appeared in the document.
PLSA decomposes this matrix as a linear combination
of a set of multinomial distributions over the words
called topics where the weight vectors are multino-
mial distributions as well. Non-negativity constraint
is imposed implicitly because the extracted topics or
basis distributions and weights represent probabili-
ties. It has been shown that the underlying compu-
tations in NMF and PLSA are identical (Gaussier and
Goutte, 2005; Shashanka et al., 2008). However, un-
like NMF where there are no additional constraints
beyond nonegativity, PLSA bases and weights being
multinomial distriutions also have the contraint that
the entries sum to 1. Since the weights sum to 1, the
PLSA approximations of the data can be thought of
as lying within a simplex defined by the basis distriu-
tions. (Shashanka, 2009) formalizes this geometric
intuition as Simplex Decompositions where the model
248
Shashanka M. and Giering M. (2012).
SIMPLEX DECOMPOSITIONS USING SVD AND PLSA.
In Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods, pages 248-252
DOI: 10.5220/0003797802480252
Copyright
c
SciTePress
extracts basis vectors that combine additivelyand cor-
respond to the corners of a simplex surrounding the
modeled data. PLSA and its extensions such as Latent
Dirichlet Allocation (Blei et al., 2003) and Correlated
Topic Models (Blei and Lafferty, 2006) are specific
examples of Simplex Decompositions.
Since PLSA (and other PLSA extensions) does
not decompose the data-vectors themselves but the
underlying multinomial distributions (i.e. the data
vectors normalized to sum to unity), the extracted ba-
sis vectors don’t lie in the data space but lie on the
standard simplex. This can be a drawback depend-
ing on the dataset under consideration and may pose
a particular poroblem if the data is aligned such that
most of the variability and structure characterizing the
dataset lies in directions orthogonal to the standard
simplex. In such cases, the projections of the data
vectors onto the simplex (which is what is decom-
posed by PLSA) carry very little information about
the shape of the data distribution and thus the obtained
PLSA bases are much less informative.
In this paper, we propose an approach to get
around this drawback of PLSA. We first use Singu-
lar Values Decomposition (SVD) to identify the di-
rections of the most variability in the dataset and then
transform the dataset so that these vectors are paral-
lel to the standard simplex. We perform PLSA on the
transformed data and obtain PLSA basis vectors in the
transformed space. Since the transformation is affine
and invertible, we apply the inverse transformation on
the basis vectors to obtain basis vectors the character-
ize the data in the original data space. These basis
vectors no longer are constrained to live on the stan-
dard simplex but lie within the data space and corre-
spond to corners of a simplex that surrounds all the
data points.
The paper is organized as follows. In Section 2,
we provide the necessary background by describing
the PLSA algorithm and geometry. Section 3 de-
scribes our proposed approach and constitutes the
bulk of the paper. We illustrate the applicability of the
method by applying the proposed technique on syn-
thetic data. We also provide a short discussion of the
algorithm and its applicability for semi-nonnegative
factorizations. We conclude the paper in Section 4
with a brief summary and avenues for future work.
2 BACKGROUND
Consider an M × N non-negativedata matrix V where
each column v
n
represents the n-th data vector and
v
mn
represents the (mn)-th element. Let
¯
v
n
represent
the normalized vector v
n
and
¯
V is the matrix V with
Figure 1: Illustration of Probabilistic Latent Semantic Anal-
ysis. The data matrix V with 1000 3-dimensional vectors
v
n
is shown as red points and the normalized data
¯
V is
shown as blue points on the Standard simplex. PLSA was
performed on V and the three extracted basis distributions
shown by w
1
, w
2
and w
3
are points on the Standard simplex
that form the corners of the PLSA simplex around normal-
ized data points
¯
v
n
shown in blue.
all columns normalized.
PLSA characterizes the bidimensional distribution
P(m,n) underlying V as
P(m,n) = P(n)P(m|n) = P(n)
z
P(m|z)P(z|n), (1)
where z is a latent variable. PLSA represents
¯
v
n
as
data distributions P(m|n) which in turn is expressed
as a linear combination of basis distributions P(m|z).
These basis distributions combine with different pro-
portions given by P(z|n) to form data distributions.
PLSA parameters P(m|z) and P(z|n) can be esti-
mated through iterations of the following equations
derived using the EM algorithm,
P(z|m,n) =
P(m|z)P(z|n)
z
P(m|z)P(z|n)
,
P(m|z) =
n
v
mn
P(z|m,n)
m
n
v
mn
P(z|m,n)
, and
P(z|n) =
m
v
mn
P(z|m,n)
m
v
mn
.
EM algorithm guarantees that the above updates con-
verge to a local optimum.
PLSA can be written as a matrix factorization
¯
V
M× N
W
M× Z
H
Z×N
= P
M× N
, (2)
where W is the matrix of basis distributions P(m|z)
with column w
z
corresponding to the z-th basis dis-
tribution, H is the mixture weight distriution matrix
of entries P(z|n) with column h
n
corresponding to the
SIMPLEX DECOMPOSITIONS USING SVD AND PLSA
249
n-th data vector, and P is the matrix of model approxi-
mations P(m|n) with column p
n
corresponding to the
n-th data vector. See Figure 1 for an illustration of
PLSA.
3 ALGORITHM
The previous section described PLSA algorithm and
illustrated the geometry of the technique. This sec-
tion presents our proposed approach. We first briefly
present the motivation for our algorithm and then de-
scribe the details of the algorithm. We illustrate the
algorithm by applying it on a synthetic dataset.
3.1 Motivation
As illustrated in Figure 1, the basis distributions ob-
tained by applying PLSA on a dataset lie on the Stan-
dard simplex. The basis distributions form the corners
of a PLSA Simplex containing not the original data-
points but the normalized datapoints instead.
Our goal is to extend the technique so that the ba-
sis vectors form a simplex around the original data-
points. In other words, we would like to remove the
constraint that the basis vectors form multinomial dis-
tributions and thus they don’t have to lie on the stan-
dard simplex. However, since we need the basis vec-
tors to still form a simplex around the data approxima-
tions, the mixture weights with which they combine
are still constrained to be multinomial distributions.
The necessity of such an approach becomes appar-
ent when one considers the implication of normaliza-
tion of datapoints that PLSA implicitly does. The nor-
malization skews the relative geometry of datapoints.
In certain cases, the normalization can hide the real
shape of the distriution of datapoints as illustrated in
Figure 2.
3.2 Problem Formulation
Given the data matrix V, we would like to find a ma-
trix decomposition similar to equation 2 of the form
V
M× N
W
M× Z
H
Z×N
= P
M× N
(3)
where Z is the dimensionality of the desired decom-
position, W is the matrix of basis vectors, H is the
matrix of mixture weights, and P is the matrix of ap-
proximations.
The above equation is similar to equation (2) but
with important differences. In equation (2), the ma-
trix undergoing decomposition is
¯
V whereas the goal
here is to decompose the original data matrix V. The
(000)
(001)
(010)
(100)
Figure 2: Illustration of normalization on a dataset. Points
in red represents a dataset of 1000 3-dimensional points
where the directions of maximum variance are orthogonal
to the plane corresponding to the standard simplex. Thus,
the projection of points in the dataset onto the standard sim-
plex removes important information about the distribution
of datapoints.
matrix W is analogous to W from equation (2) but un-
like the columns of W that are constrained to sum to
1, the columns of W have no such constraints. Simi-
larly, P is analogous to P but the columns of the for-
mer are not constrained to sum to 1 like the columns
of P. However, since both equations (2) and (3) are
simplex decompositions, matrices H and H are alike
with entries in each of their columns constrained to
sum to 1.
3.3 Algorithm
Consider a scenario where a dataset V that we desire
to decompose using PLSA already lies on the stan-
dard simplex. Then, all the constraints that we need as
described in the previous subsection are already satis-
fied. Since all data points lie on the standard simplex,
the dataset V is identical to its normalized version
¯
V.
Hence, the decomposition desired in equation (3) be-
comes identical to the decomposition in equation (2).
We can apply PLSA directly to the given dataset V
and obtain the desired basis vectors.
This observationpoints to the approach we present
below. If we could transform the dataset so that all
points lie on the standard simplex and the transfor-
mation is invertible, we can achieve the desired de-
composition. However, the standard simplex in M-
dimensional space represents part of the (M 1)-
dimensional hyperplane. Thus, instead of being able
to have the points exactly lie on the standard simplex,
we are constrained to transforming data such that the
projections of the data onto (M 1) dimensions of
ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods
250
Figure 3: Results of applying PLSA on the dataset shown
in Figure 2. Since the projections of data points on the stan-
dard simplex (shown in blue) have very narrow variance in
one direction, the PLSA simplex obtained is degenerate and
almost forms a straight line through the data projections.
our choice will lie on the simplex. Choosing the first
(M 1) principal components of the dataset as the
(M 1) dimensions on which data will be projected
will produce the least error of all possible projections.
The problem now reduces to finding the right
transformation that takes the projections of the data
on the first (M 1) principal components and aligns
them parallel to the standard simplex. The last princi-
pal component is transformed such that it left orthog-
onal to the standard simplex. We leverage the work of
(Shashanka, 2009) to define this transformation ma-
trix. More details can be found in the Appendix.
Given the M × N data matrix V, the entire algo-
rithm can be summarized as follows:
1. Center the data by removing the mean vector to
obtain
ˆ
V, i.e.
ˆ
V = V mean(V).
2. Perform SVD of matrix
ˆ
V
T
to obtain U, the ma-
trix of data projections on the singular vectors, i.e.
ˆ
V
T
= USX
T
.
3. Obtain the M × M transformation matrix T (see
Appendix for details of this computation).
4. Transform the data to lie parallel to the standard
simplex, i.e. B = (UT
T
)
T
.
5. Center the transformed data such that the centroid
of the simplex coincides with the data mean, i.e.
¯
B = B mean(B) + c, where c is a vector corre-
sponding to the centroid of the standard simplex.
6. Ensure all entries of
¯
B are nonnegative by sub-
tracting the minimum entry from the matrix, i.e.
ˆ
B =
¯
B min(
¯
B).
7. Normalize the matrix
ˆ
B such that entries of the
center of the dataset sum to 1, i.e. B
=
ˆ
B/b,
where b = 1 min(
¯
B).
8. The matrix is now ready for PLSA. Apply PLSA
on B
to obtain W and H , i.e. B
WH .
9. Undo steps 7, 6, 5 and 4 respectively for the basis
vector matrix W to obtain
¯
W, i.e.
W = W× b
W = W+ min(
¯
B)
W = W+ mean(B) c
¯
W = W
T
T
10. Undo the SVD projection and data centering for
¯
W to obtain W , i.e.
¯
W = (
¯
WSX
T
)
T
W =
¯
W+ mean(V)
The desired decomposition is given by V W H .
Figure 4: Result of applying our approach on the dataset
illlustrated in Figure 2. As desired, the extracted basis vec-
tors form a simplex (dotted black line around the red points)
around the original datapoints instead of around the data
projections on the standard simplex.
For experiments, we created a synthetic dataset of
1000 3-dimensional points as illustrated in Figure 2.
The dataset was created in such a way that the direc-
tions of maximal variance present in the data was or-
thogonal to the plane of the standard simplex. Re-
sults of applying PLSA on the dataset is summarized
in Figure 3 and results of applying the proposed ap-
proach is illustrated in Figure 4.
3.4 Discussion
We first point out that even though we have used
PLSA as the specific example, the proposed approach
is applicable to any topic modeling technique such as
Latent Dirichlet Allocation or Correlated Topic Mod-
els where data distributions are expressed as linear
combinations of charateristic basis distributions.
In the approach described in the previous subsec-
tions, no explicit constraints were placed as to the
nonnegativity of the entries of basis vectors. So far in
SIMPLEX DECOMPOSITIONS USING SVD AND PLSA
251
this paper, we have focused on data that have nonneg-
ative entries but the proposed approach is also appli-
cable for datasets with real-valued entries. The algo-
rithm described earlier can be applied to any arbitrary
datasets with real-values entries without any modifi-
cations. This is an alternative approach to the one
proposed by (Shashanka, 2009). In that work, data
is transformed into the next higher dimension so that
PLSA can be applied while in this work, we use SVD
to align the dataset along the dimensions of the stan-
dard simplex. It will be instructiveto compare the two
approaches in this context and we leave that for future
work.
4 CONCLUSIONS
In this paper, we presented a novel approach to per-
form Simplex Decompositions on datasets. Specifi-
cally, the approach learns a set of basis vectors such
that each data vector can be expressed as a linear com-
bination of the learned set of bases and where the cor-
responding mixture weights are nonnegative and sum
to 1. PLSA performs a similar decomposition but it
characterizes the normalized datapoints instead of the
original dataset itself. We demonstrated the spurious
effect such a normalization can have with the help of
a synthetic dataset. We described our approach and
demonstrated that it provides a way to overcome this
drawback. This work has several potential applica-
tions in tasks such as clustering, feature extraction,
and classification. We would like to continue this
work by applying the technique on real-world prob-
lems and demonstrating its usefulness. We also intend
to extend this work to be applicable to other related
latent variable methods such as Probabilistic Latent
Component Analysis.
REFERENCES
Blei, D. and Lafferty, J. (2006). Correlated Topic Models.
In NIPS.
Blei, D., Ng, A., and Jordan, M. (2003). Latent Dirichlet
Allocation. Jrnl of Machine Learning Res., 3.
Gaussier, E. and Goutte, C. (2005). Relation between PLSA
and NMF and Implications. In Proc. ACM SIGIR
Conf. on Research and Dev. in Information Retrieval,
pages 601–602.
Hofmann, T. (2001). Unsupervised Learning by Probabilis-
tic Latent Semantic Analysis. Machine Learning, 42.
Lee, D. and Seung, H. (1999). Learning the parts of objects
by non-negative matrix factorization. Nature, 401.
Lee, D. and Seung, H. (2001). Algorithms for Non-negative
Matrix Factorization. In NIPS.
Shashanka, M. (2009). Simplex Decompositions for Real-
Valued Datasets. In Proc. Intl. Workshop on Machine
Learning and Signal Processing.
Shashanka, M., Raj, B., and Smaragdis, P. (2008). Prob-
abilistic latent variable models as non-negative fac-
torizations. Computational Intelligence and Neuro-
science.
APPENDIX
In this appendix, we briefly describe how to choose
the transformation matrix T that transforms M-
dimensional data V such that the first (M 1) prin-
cipal components lie parallel to the standard (M 1)-
Simplex. We need to indentify a set of (M 1) M-
dimensional orthonormal vectors that span the stan-
dard (M 1)-simplex.
(Shashanka, 2009) developed a procedure to find
exactly such a matrix and the method is based on in-
duction. Let R
M
denote a M × (M 1) matrix of
(M 1) orthogonal vectors. Let
~
1
M
and
~
0
M
denote
M-vectors where all the entries are 1’s and 0’s respec-
tively. Similarly, let 1
a×b
and 0
a×b
denote a × b ma-
trices of all 1’s and 0s respectively. They showed that
the matrix R
(M+1)
given by
"
R
M
~
1
M
~
0
T
(M1)
M
#
if M is even, and
R
(M+1)/2
0
(M+1)/2×(M1)/2
~
1
(M+1)/2
0
(M+1)/2×(M1)/2
R
(M+1)/2
~
1
(M+1)/2
,
if M is odd, is orthogonal. R
(M+1)
is then normalized
to obtain an orthonormal matrix.
Given the above relation and the fact that R
1
is an
empty matrix, one can compute R
M
inductively for any
value of M.
We have an additional constraint that the last prin-
cipal component be orthogonal to the standard simplex
and this can be easily achieved by appending a column
vector of 1’s to R
M
.
Thus, the matrix T defining our desired transfor-
mation is given by [R
M
~
1
M
].
ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods
252