Non-negative Matrix Factorization for Binary Data
Jacob Søgaard Larsen and Line Katrine Harder Clemmensen
DTU Compute, Technical University of Denmark, Richard Petersens Plads, 2800, Lyngby, Denmark
Keywords:
Non-negative Matrix Factorization, Binary Data, Binary Matrix Factorization, Text Modelling.
Abstract:
We propose the Logistic Non-negative Matrix Factorization for decomposition of binary data. Binary data
are frequently generated in e.g. text analysis, sensory data, market basket data etc. A common method for
analysing non-negative data is the Non-negative Matrix Factorization, though this is in theory not appropriate
for binary data, and thus we propose a novel Non-negative Matrix Factorization based on the logistic link
function. Furthermore we generalize the method to handle missing data. The formulation of the method
is compared to a previously proposed logistic matrix factorization without non-negativity constraint on the
features. We compare the performance of the Logistic Non-negative Matrix Factorization to Least Squares
Non-negative Matrix Factorization and Kullback-Leibler (KL) Non-negative Matrix Factorization on sets of
binary data: a synthetic dataset, a set of student comments on their professors collected in a binary term-
document matrix and a sensory dataset. We find that choosing the number of components is an essential part
in the modelling and interpretation, that is still unresolved.
1 INTRODUCTION
Non-negative matrices are found in many different
forms, from a general matrix with non-negative en-
tries to the case with only binary entries. The lat-
ter is an interesting case used in many fields e.g.
text data, sensory data etc. A common tool for pre-
processing data by unsupervised decompsition is the
Non-negative Matrix Factorization (NMF) proposed
by Lee and Seung (Lee and Seung, 1999; Lee and Se-
ung, 2001). One issue with the general NMF is that
the resulting approximation is not bounded above, and
hence not suitable for the binary case. Zhang et al.
proposed the Binary Matrix Factorization that factor-
izes the binary data matrix X into two binary matri-
ces W and H (Zhang et al., 2010). The interpretation
of such a decomposition may be difficult, since the
method does not estimate how important an entry in
the components are, and therefore we will not con-
sider this method for our purpose. Gillis proposed that
when NMF is used on text data, the components are
interpreted as topics (Gillis, 2014). The model also
describes how important a topic is for each document
and how important a term is for a topic. We adapt
this approach and propose a logistic non.negative ma-
trix factorization. Recently, Tom
´
e et al. proposed a
logistic but only partially non-negative matrix factor-
ization, where the model allows for negative feature
components (Tom
´
e et al., 2015), whereas our method
is strictly non-negative and explicit modelling of the
threshold in the logistic sigmoidal. Tom
´
e et al. fur-
ther extended the model with a Lagrangian penalty on
the two norm of the columns of W and H. Both our
method and the methods by Tom
´
e et al. uses a gra-
dient based update scheme. Tom
´
e et al. uses a con-
stant step length, where we use an adaptive scheme
to ensure non-negativity. Tom
´
e et al. ensures non-
negativity by projection. In order to evaluate how well
the model generalizes the data, Tom
´
e et al. uses a set-
up with a test- and training-set, while we have gen-
eralized our method to handle missing data, thus en-
abling the use of cross-validation. The methods pro-
posed by Tom
´
e et al. is tested on synthetic data with
binary basis vectors and the USPS digits. The em-
phasis is put on how well the model reconstructs data,
while we focus on estimating the correct number of
feature vectors and the interpretation of the model.
Furthermore we test our model on sensory data and
text data.
The training process and selecting the model com-
plexity is another issue regarding NMF. Nielsen and
Mørup proposed to marginalize missing data in order
to perform cross-validation (CV) to choose the num-
ber of components in the model (Nielsen and Mørup,
2014), and we will use this approach in the paper.
Larsen, J. and Clemmensen, L..
Non-negative Matrix Factorization for Binary Data.
In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2015) - Volume 1: KDIR, pages 555-563
ISBN: 978-989-758-158-8
Copyright
c
2015 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
555
2 METHOD
Non-negative Matrix Factorization (NMF) belongs to
the family of Factor Analysis methods and was pro-
posed by Lee and Seung (Lee and Seung, 1999; Lee
and Seung, 2001). As the name states, it computes a
low rank approximation of a M × N data matrix, X,
consisting of the non-negative matrices W R
M×K
,
with entries w
i,d
, and H R
K×N
, with entries h
d, j
,
such that x
i, j
K
d=1
w
i,d
h
d, j
. Generally the problem
is formulated as in (1), where D(·, ·) is a distance mea-
sure. The cost function of interest obtained by letting
C(W, H) = D(X,W H).
min
W R
M×K
,HR
K×N
D(X,W H), st. W 0, H 0 (1)
In this article, three different measures are used:
Least Squares, KL-divergence and cross-entropy. The
matrices W and H are computed using Multiplicative
Updates (MU) described in Section 2.2, 2.3 and 2.4.1.
Other methods for computing W and H are described
in (Gillis, 2014, §3.1).
The derivative with respect to an arbitrary element
w
m,n
or h
m,n
of either W or H, which for a general
purpose will be termed β
m,n
, with respect to a general
cost function C(·), can be decomposed into a posi-
tive and a negative part (2a). Using a gradient de-
scent method with the step size (2b), this results in
the generic update formula (2c). This formula can be
shown to converge towards a non-negative solution
(Lee and Seung, 2001). The problem of finding the
optimal solution is NP-hard (Gillis, 2014), this mean
that an update procedure will converge towards a local
minimum.
C(β)
∂β
m,n
=
C(β)
∂β
m,n
+
C(β)
∂β
m,n
(2a)
β
m,n
= β
m,n
η
m,n
C(β)
∂β
m,n
, η
m,n
=
β
m,n
C(β)
∂β
m,n
+
(2b)
β
m,n
= β
m,n
C(β)
∂β
m,n
C(β)
∂β
m,n
+
(2c)
It has been shown that NMF-algorithms improves
convergence speed by updating the same factor mul-
tiple times (Gillis and Glineur, 2012). In this paper
each factor was updated 10 times for each iteration.
2.1 Selecting the Number of
Components
In order to determine the optimal number of compo-
nents K in W and H, Nielsen and Mørup proposed a
marginalization approach for handling missing data as
an alternative to Expectation Maximization (Nielsen
and Mørup, 2014). The method uses an indicator ma-
trix R, with entries r
i, j
, that is 1 if x
i, j
is present, and
0 otherwise. This enables the use of Cross Validation
(CV) to estimate the generalization error of a model,
the ”one standard-error rule” (Hastie et al., 2009) can
then be used to select the optimal number of compo-
nents.
2.2 Least Squares
The Least Squares NMF uses the Frobenius norm as
distance measure, resulting in the objective (3a). To-
gether with the non-negativity constraints of W and H
this constitutes the problem formulation. The multi-
plicative update formulas for the Least Squares for-
mulation of the Non-negative Matrix Factorization
(3b) and (3c) are based on taking gradient steps and
step sizes of (3a) as defined in (2b).
C
LS
=
i, j
(x
i, j
(W H)
i, j
)
2
(3a)
w
i,d
w
i,d
(XH
T
)
i,d
(W HH
T
)
i,d
(3b)
h
d, j
h
d, j
(W
T
X)
d, j
(W
T
W H)
d, j
(3c)
The marginalization approach uses the slightly modi-
fied objective function (4a). (4b) and (4c) defines the
gradients. By using step size 2b, the resulting MU
formulas are (4d) and (4e).
C
R
LS
=
i, j
r
i, j
(x
i, j
(W H)
i, j
)
2
(4a)
w
i,d
=
j
r
i, j
((W H)
i, j
x
i, j
)h
d, j
(4b)
h
d, j
=
i
r
i, j
((W H)
i, j
x
i, j
)w
i,d
(4c)
w
i,d
w
i,d
j
r
i, j
x
i, j
h
d, j
j
r
i, j
(W H)
i, j
h
d, j
(4d)
h
d, j
h
d, j
i
r
i, j
x
i, j
w
i,d
i
r
i, j
(W H)
i, j
w
i,d
(4e)
2.3 KL-divergence
Non-negative data are poorly approximated by a nor-
mal distribution, (Lee and Seung, 1999) proposes to
use the divergence (5e) instead. The term divergence’
is used instead of distance, since the measure is not
symmetric in X and W H. This corresponds to a model
in which x
i, j
has a Poisson distribution with mean
(W H)
i, j
(Hastie et al., 2009, §14.6) and has received
SSTM 2015 - Special Session on Text Mining
556
its name since it reduces to the Kullback-Leibler di-
vergence for
i, j
x
i, j
=
i, j
(W H)
i, j
= 1 (Lee and Se-
ung, 2001).
The formulas for the multiplicative update scheme
are given in Equation (5d) and (5e), they are derived
from (5a) using the derivative and step size given in
(5b) and (5c).
C
KL
=
i, j
x
i, j
log(W H)
i, j
(W H)
i, j
(5a)
w
i,d
=
j
h
d, j
j
x
i, j
h
d, j
(W H)
i, j
, η
i,d
=
w
i,d
j
h
d, j
(5b)
h
d, j
=
i
w
i,d
i
x
i, j
w
i,d
(W H)
i, j
, η
d, j
=
h
d, j
i
w
i,d
(5c)
w
i,d
w
i,d
j
x
i, j
(W H)
i, j
h
d, j
j
h
d, j
(5d)
h
d, j
h
d, j
i
w
i,d
x
i, j
(W H)
i, j
i
w
i,d
(5e)
Similar to the Least Squares setting, the marginal-
ization approach can also be applied to the KL-
divergence. The modified cost function is given in
(6a). The multiplicative update formulas (6d) and (6e)
are derived using the generic approach (2a)-(2c) with
derivatives given in (6b) and (6c).
C
R
KL
=
i, j
r
i, j
(x
i, j
log(W H)
i, j
(W H)
i, j
) (6a)
w
i,d
=
j
r
i, j
h
d, j
x
i, j
h
d, j
(W H)
i, j
(6b)
h
d, j
=
i
r
i, j
w
i,d
x
i, j
w
i,d
(W H)
i, j
(6c)
w
i,d
w
i,d
j
r
i, j
x
i, j
(W H)
i, j
h
d, j
j
r
i, j
h
d, j
(6d)
h
d, j
h
d, j
i
w
i,d
r
i, j
x
i, j
(W H)
i, j
i
r
i, j
w
i,d
(6e)
2.4 Logistic NMF
For binary data a general NMF is not optimal in the
sense that it maps onto the entire positive real space
of numbers. The Logistic Non-negative Matrix Fac-
torization is therefore proposed. The model is a gen-
erative model formed by a combination of NMF and
logistic regression. The model is given in Equation
(7) with y
i, j
= p(1|
d
w
i,d
h
d, j
, c
i, j
) being the proba-
bility of y
i, j
being 1 and σ(·) being the logistic sig-
moid function (8a). The threshold c
i, j
is estimated in
two different ways: a global constant applicable for
all y
i, j
and a rank 1 approximation u
i
v
j
. See the de-
scription below.
a
i, j
=
d
w
i,d
h
d, j
c
i, j
, w
i,d
, h
d, j
, c
i, j
0 (7a)
y
i, j
= σ(a
i, j
) = p(1|
d
w
i,d
h
d, j
, c
i, j
) (7b)
p(0|
d
w
i,d
h
d, j
, c
i, j
) = 1 p(1|
d
w
i,d
h
d, j
, c
i, j
)
(7c)
σ(a) =
1
1 + exp(a)
(8a)
dσ
da
= σ(1 σ) (8b)
General Cost Function. The model (7) applied to
a 0-1 coded matrix X, lead to the Likelihood function
(9a) and the negative log Likelihood (9b).
p(X|W, H, c
i, j
) =
i, j
y
x
i, j
i, j
(1 y
i, j
)
1x
i, j
(9a)
C
LL
(W, H, c
i, j
) = log (p (X|W, H, c
i, j
))
=
i, j
x
i, j
log(y
i, j
) + (1 x
i, j
)log(1 y
i, j
) (9b)
Marginalized Cost Function. In case of missing
data, a generalization of (9) is given as the marginal-
ized Likelihood function in (10a) and the correspond-
ing negative log Likelihood (10b).
p
r
(X|W, H, c
i, j
) =
i, j
y
x
i, j
i, j
(1 y
i, j
)
1x
i, j
r
i, j
(10a)
C
R
LL
(W, H, c
i, j
) = log (p
r
(X|W, H, c
i, j
))
=
i, j
r
i, j
(x
i, j
log(y
i, j
) + (1 x
i, j
)log(1 y
i, j
))
(10b)
2.4.1 Determining the Parameters
In order to determine the parameters of the model,
the optimization problem (11) is formulated for the
marginalized negative log Likelihood, as:
min
W,H,c
i, j
C
R
LL
(W, H, c
i, j
) (11a)
subject to. W R
M×K
, W 0 (11b)
H R
K×N
, H 0 (11c)
c
i, j
R, c
i, j
0 (11d)
Non-negative Matrix Factorization for Binary Data
557
Global Threshold. Using a global constant c
i, j
= c,
the optimization problem is convex (Boyd and Van-
denberghe, 2009, §7.1) separately in W , H and c,
henceforth this variant is known as Global thresh
NMF. Therefore it is solved using MU based on a
gradient descend method. The partial derivatives with
respect to w
i,d
, h
d, j
and c are given in (12). They are
derived using the Chain Rule, and the derivative of the
logistic sigmoid function given in (8b).
C
LL
R(W, H, c)
w
i,d
=
j
r
i, j
(y
i, j
x
i, j
)h
d, j
(12a)
C
R
LL
(W, H, c)
h
d, j
=
i
r
i, j
(y
i, j
x
i, j
)w
i,d
(12b)
C
R
LL
(W, H, c)
c
=
i, j
r
i, j
(x
i, j
y
i, j
) (12c)
Using the step size defined in (2b), this leads to the
multiplicative update formulas (13).
w
i,d
w
i,d
j
r
i, j
x
i, j
h
d, j
j
r
i, j
y
i, j
h
d, j
(13a)
h
d, j
h
d, j
i
r
i, j
x
i, j
w
i,d
i
r
i, j
y
i, j
w
i,d
(13b)
c c
i, j
r
i, j
y
i, j
i, j
r
i, j
x
i, j
(13c)
Rank 1 Approximation of Threshold. When in-
troducing a rank 1 approximation of the threshold for
each (i, j), such that c
i, j
= u
i
v
j
, the problem is still
convex in u
i
and v
j
separately, henceforth this method
is known as Max thresh NMF. Let u
i
be the ith row of
U and v
j
be the j’th column of V , the partial deriva-
tives are then as shown in (14)
C
R
LL
(W, H,U,V )
u
i
=
j
r
j
(x
i, j
y
i, j
)v
j
(14a)
C
R
LL
(W, H,U;V )
v
j
=
i
r
j
(x
i, j
y
i, j
)u
i
(14b)
Using the step size defined in (2b), the update formu-
las are then given in (15).
u
i
u
i
j
r
i, j
y
i, j
v
j
j
r
i, j
x
i, j
v
j
(15a)
v
j
v
j
i
r
i, j
y
i, j
u
i
i
r
i, j
x
i, j
u
i
(15b)
2.4.2 Constraints
The update formulas introduced in (13c), (15a) and
(15b) introduce a risk of dividing by zero, or in the
case of a sparse matrix, the numerator may be much
larger than the denominator. This, together with (7a)
introduces the risk of both W H and c
i j
exploding in
size. This is avoided by adding the constraint (16) to
the optimization problem (11) with λ being a positive
constant.
c
i, j
λ (16)
3 SIMULATED DATA
3.1 Generating Simulated Data
The four NMF variants are compared on simulated
data where we know the true underlying components.
The data simulates text data, which is ensured by
putting the following conditions on W and H.
1. Columns of W are sparse
2.
k
W
ik
= 1
3. Rows of H are exponentially distributed
1.: Each component represent a topic (Gillis, 2014),
but each topic is not necessarily present in all doc-
uments. 2.: The pure documents do not contain
any noise, and are therefore represented only by the
constructed topics. 3.: Terms are exponentially dis-
tributed in natural languages (Paukkeri, 2012).
To investigate how the two Logistic NMF methods
behave, compared to LS NMF and KL NMF when
applied to text data, three problems have been created,
with varying degree of sparsity in the score vectors
P. 1 25 % entries in score vectors
P. 2 50 % entries in score vectors
P. 3 75 % entries in score vectors
The data matrix
˜
X is then generated as described in
(17), with τ being a positive constant.
a = W H τ (17a)
X =
1
1 + exp(a)
(17b)
˜
X = round(X ) (17c)
3.2 Analysing Simulated Data
In all the problems, 4 topics are simulated with
columns of W and rows of H having length 50. The
resulting data matrix is therefore of size 50 × 50. The
problems are simulated 100 times and modelled by
each NMF method. For each simulation, the dataset is
SSTM 2015 - Special Session on Text Mining
558
(a) Problem 1. (b) Problem 2.
(c) Problem 3.
Figure 1: Box plots of estimated number of components.
Length of whiskers are 1.5 IQR. Dashed line indicate true
number of components.
divided into 11 parts, the first 10 parts are used to es-
timate the optimal number of components by 10 fold
CV. The last part is used to report how well the model
predicts unknown data.
Analysing the Estimated Number of Components.
Figure 1 shows box plots of the estimated number of
components for each method. By performing a t-test
with significance level α = 0.05 it is concluded that
none of the methods estimate the correct number of
components (4). By using a Welch t-test at signifi-
cance level α = 0.05 it is tested whether the methods
estimate different number of components. Instances
where the test level is not significant are written with
bold face in Table 1.
Table 1: P-values for problems and methods where the
methods estimate the same number of components.
P. 1 P. 2 P. 3
LS vs. Glob. thresh 0.19 1.2 ·10
6
< 10
10
KL vs. Max thresh 0.087 2.7 ·10
8
< 10
10
Glob. vs. Max 0.22 0.23 0.62
Analysing the Error Level. Neither Least Squares
error (3a), KL-divergence (5a) or Cross-entropy (9b)
are directly comparable. Hence only the two Logistic
NMF methods are comparable in terms of prediction
error. Figure 2 shows box plot of the mean cross-
entropy of prediction when using the optimal number
of components. Using a Welch t-test, a significance
level α = 0.05 reveal that for all of the problems, the
(a) Problem 1. (b) Problem 2.
(c) Problem 3.
Figure 2: Comparison of the mean cross-entropy of predic-
tion for Max NMF and Const NMF with optimal number of
components. Length of whiskers are 1.5 IQR.
prediction error is considered the same for the two
methods. The p-values are 0.19, 0.12 and 0.08 for
Problem 1, 2 and 3 respectively.
(a) Problem 1. (b) Problem 2.
(c) Problem 3.
Figure 3: Comparison of the mean cross-entropy of pre-
diction for Max NMF and Const NMF with 4 component
models. Length of whiskers are 1.5 IQR.
Figure 3 shows box plot of the mean cross-entropy
of prediction when 4 components are used for both
Global thresh NMF and Max thresh NMF. A Welch
t-test with significance level α = 0.05 is performed to
Non-negative Matrix Factorization for Binary Data
559
(a) LS NMF (b) KL NMF
(c) Global thresh NMF (d) Max thresh NMF
Figure 4: CV error for text data. (a) and (b): Mean CV error
with standard error bars. (c) and (d): Mean CV error sub-
tracted one standard error. Circles and crosses mark the op-
timal model complexity with regard to ”one standard-error
rule” (Hastie et al., 2009).
test whether the error levels are significantly different.
The test value for all three problems are p < 10
10
, re-
vealing that for all three problems Global thresh NMF
have lower prediction error than Max thresh NMF.
4 REAL DATA
Two different real datasets are analysed using the four
NMF variants: A dataset consisting of student com-
ments and a sensory dataset (Randall, 1989).
4.1 Text Data
A collection of 10579 student comments on pro-
fessors at various American universities collected
at www.ourumd.com is being analysed. After pre-
processing (including correcting misspellings, stem-
ming and replacing numbers and names with relevant
tags) the collection holds 9400 different words. The
resulting Term Document Matrix (TDM) X is there-
fore of size 10579 × 9400. The estimated models
are analysed using the procedure described in (Gillis,
2014). The interpretation of W and H are swapped,
since the i’th document is collected in the i’th row of
X.
The ability of each method to generalize the data,
is estimated using 5-fold CV. In order to avoid local
Table 2: Most dominant words for LS NMF.
Component 1
exam lectur num grade
question final test class
help name book studi
good materi point homework
hard problem note read
Component 2
num class name professor
cours interest great !
student teacher easi lot
recommend teach learn good
paper materi read make
Table 3: Most dominant words for KL NMF.
Component 1
num class name professor
cours lectur exam grade
good easi lot student
help materi test read
teacher teach ! question
Table 4: Most dominant words for Global thresh NMF
Component 1
num lectur exam class
question studi easi onlin
slide answer post attent
test note choic pai
multipl hard name attend
Component 2
num class name cours
professor exam good materi
help lectur cover start
student prepar level come
found teach will school
Component 3
num class name particip
essai group present grade
easi fun paper interest
core read project !
short page midterm requir
minima, each fold is re-estimated 10 times. Figure 4
shows the estimated generalization error and the cho-
sen model complexity for each method. The dominant
words of each topic are shown in Tables 2-5.
4.2 Wine Data
The four variants of NMF are applied to the ’Bitter-
ness of Wine’ dataset (Randall, 1989). The data is
SSTM 2015 - Special Session on Text Mining
560
Table 5: Most dominant words for Max thresh NMF.
Component 1
offic ve look complet
tell averag feel major
hour practic high sai
ask school minut num
show mean gener give
Component 2
multipl post answer question
choic final slide onlin
concept attend class num
midterm grade point requir
exam semest studi consist
represented by introducing the 7 binary variables
Contact (1 if ”yes”, 0 if ”no”)
Temperature (1 if ”warm”, 0 if ”cold”)
Rating 1
Rating 2
Rating 3
Rating 4
Rating 5
The generalization error of each of the models is es-
timated using 10-fold Cross validation. In order to
avoid local minima in the training process, each fold
is trained 30 times and the model with lowest val-
idation error is reported. The model complexity is
chosen using ”one standard-error rule” (Hastie et al.,
2009). The estimated generalization error and chosen
model complexities are shown in Figure 5.
Figure 6 shows the components of the estimated 6
component model using Global thresh NMF. From the
rows of H it is seen the components basically describe
variable each - except for Contact which is described
by both component 5 and 6, further component 4 de-
scribe both Temperature and Rating 5.
5 DISCUSSION
5.1 Simulated Data
On the simulated problems we saw that the assump-
tions regarding the distribution of the noise, and
thereby the choice of method, influences the how
many components is estimated as being optimal. We
observed that all four methods have a tendency to es-
timate too many components as being optimal even
though the ”one standard-error” rule was used to
choose the number of components in the models.
(a) LS NMF (b) KL NMF
(c) Global thresh NMF (d) Max thresh NMF
Figure 5: CV error for wine data. (a) and (b): Mean CV
error with standard error bars. (c) and (d): Mean CV er-
ror subtracted one standard error. Circles and crosses mark
the optimal model complexity with regard to ”one standard-
error rule” (Hastie et al., 2009).
When changing the number of components, the inter-
pretation of a model may change, as the distribution of
signal among the components is changed. Thus, the
use of extra components compared to the true number
of components (4) may hinder the interpretation of the
estimated model. A low generalization error is as im-
portant as estimating the true number of component
when building a model. Furthermore, it is observed
that the size of the error in estimating the number of
components, is problem dependent.
The error levels were compared for the Max thresh
NMF and Global thresh NMF. It was seen that when
the optimal number of components were used, the two
methods performed with the same error level. When
the true number of components were used, the Global
thresh NMF performed significantly better than Max
thresh NMF.
5.2 Text Data
The components or topics extracted from the collec-
tion of student comments are clearly related to teach-
ing and exams. The logistic NMF’s are able to ex-
tract at least as many components from the data as LS
NMF and more components than KL NMF. This indi-
cate that KL NMF are not as suited for binary data as
the other methods.
Non-negative Matrix Factorization for Binary Data
561
(a) Columns of W plotted against each other.
(b) Bar plot of rows of H.
Figure 6: Visualization of the components of estimated 6
component model using Global thresh NMF.
5.3 Wine Data
When estimating the optimal number of components,
it was observed that this issue was dependent on the
method. LS NMF, KL NMF and Max thresh NMF
estimated that 1-2 components should be used while
Global thresh NMF estimated that 6 components were
optimal. Further, LS NMF and KL NMF was seen to
have large standard errors.
The 6 component model determined with Global
thresh NMF was presented. It was seen that each vari-
able to some degree was described by a component.
Furthermore it was observed that the variable Rating
1 was not described by any of the components. The
variable Rating 1 is 1 in 5 samples while being 0 in 67
samples, i.e. approx. 7% of the samples, which may
have influenced the model.
6 CONCLUSION
We presented the two well known variants of Non-
negative Matrix Factorization and proposed the logis-
tic Non-negative Matrix Factorization for binary data.
The proposed method was presented in two variants;
one with a global threshold for the logistic sigmoid
function and one with a rank one approximation (max
threshold).
The four NMF methods were applied to a col-
lection of student comments regarding professors at
various universities. It was seen that all methods
were able to extract components that were describ-
ing teaching and exam. Global thresh NMF and Max
thresh NMF were able to validate more components
than the two usual methods: LS NMF and KL NMF.
The ”Bitterness of Wine” dataset (Randall, 1989)
was analysed using all four NMF methods. It was
seen that methods which had a large standard error
when doing Cross Validation, either estimated few
components (LS, KL and Max) or many components
(Global).
The four NMF methods were also compared on
simulated data with four underlying components. The
methods had good error convergence. However, the
methods had a tendency to choose too many compo-
nents. Furthermore, it was seen that the bias in esti-
mating the number of components was problem de-
pendent.
The interpretation may change with a varying
number of components and as this is where we ob-
served the most issues for all four studies, we recom-
mend that future work should investigate methods to
estimate the number of components.
ACKNOWLEDGEMENTS
The authors would like to thank Jacob Kogan from
Department of Mathematics and Statistics at Univerty
of Maryland for collecting the student comments from
www.ourumd.com.
REFERENCES
Boyd, S. and Vandenberghe, L. (2009). Convex Optimiza-
tion. Cambridge University Press.
Gillis, N. (2014). The why and how of nonnegative matrix
factorization. ArXiv e-prints.
Gillis, N. and Glineur, F. (2012). Accelerated multiplicative
updates and hierarchical als algorithms for nonnega-
tive matrix factorization. NEURAL COMPUTATION,
24(4):1085–1105.
SSTM 2015 - Special Session on Text Mining
562
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The
Elements of Statistical Learning - Data Mining, Infer-
ence, and Prediction. Springer, 2nd edition.
Lee, D. D. and Seung, H. S. (1999). Learning the parts of
objects by non-negative matrix factorization. Nature,
401(6755):788–791.
Lee, D. D. and Seung, H. S. (2001). Algorithms for non-
negative matrix factorization. Advances in Neural In-
formation Processing Systems, 13(13):556–562.
Nielsen, S. F. V. and Mørup, M. (2014). Non-negative ten-
sor factorization with missing data for the modeling
of gene expressions in the human brain. In 2014 IEEE
International workshop on Machine Learning for Sig-
nal Processing.
Paukkeri, M.-S. (2012). Language- and domain- indepen-
dent text mining. Doctorial Dissertations. Aalto Uni-
versity.
Randall, J. (1989). The analysis of sensory data by gener-
alised linear model. Biometrical journal, 7:pp. 781–
793.
Tom
´
e, A. M., Schachtner, R., Vigneron, V., Puntonet, C. G.,
and Lang, E. W. (2015). A logistic non-negative ma-
trix factorisation approach to binary data sets. Multi-
dim Syst Sign Process, 26:125–143.
Zhang, Z., Li, T., Ding, C., and Zhang, X. (2010). Binary
matrix factorization with applications. Data Mining
and Knowledge Discovery, 20(1):28–52.
Non-negative Matrix Factorization for Binary Data
563