Non-negative Matrix Factorization for Binary Data

Jacob Søgaard Larsen and Line Katrine Harder Clemmensen

DTU Compute, Technical University of Denmark, Richard Petersens Plads, 2800, Lyngby, Denmark

Keywords:

Non-negative Matrix Factorization, Binary Data, Binary Matrix Factorization, Text Modelling.

Abstract:

We propose the Logistic Non-negative Matrix Factorization for decomposition of binary data. Binary data

are frequently generated in e.g. text analysis, sensory data, market basket data etc. A common method for

analysing non-negative data is the Non-negative Matrix Factorization, though this is in theory not appropriate

for binary data, and thus we propose a novel Non-negative Matrix Factorization based on the logistic link

function. Furthermore we generalize the method to handle missing data. The formulation of the method

is compared to a previously proposed logistic matrix factorization without non-negativity constraint on the

features. We compare the performance of the Logistic Non-negative Matrix Factorization to Least Squares

Non-negative Matrix Factorization and Kullback-Leibler (KL) Non-negative Matrix Factorization on sets of

binary data: a synthetic dataset, a set of student comments on their professors collected in a binary term-

document matrix and a sensory dataset. We ﬁnd that choosing the number of components is an essential part

in the modelling and interpretation, that is still unresolved.

1 INTRODUCTION

Non-negative matrices are found in many different

forms, from a general matrix with non-negative en-

tries to the case with only binary entries. The lat-

ter is an interesting case used in many ﬁelds e.g.

text data, sensory data etc. A common tool for pre-

processing data by unsupervised decompsition is the

Non-negative Matrix Factorization (NMF) proposed

by Lee and Seung (Lee and Seung, 1999; Lee and Se-

ung, 2001). One issue with the general NMF is that

the resulting approximation is not bounded above, and

hence not suitable for the binary case. Zhang et al.

proposed the Binary Matrix Factorization that factor-

izes the binary data matrix X into two binary matri-

ces W and H (Zhang et al., 2010). The interpretation

of such a decomposition may be difﬁcult, since the

method does not estimate how important an entry in

the components are, and therefore we will not con-

sider this method for our purpose. Gillis proposed that

when NMF is used on text data, the components are

interpreted as topics (Gillis, 2014). The model also

describes how important a topic is for each document

and how important a term is for a topic. We adapt

this approach and propose a logistic non.negative ma-

trix factorization. Recently, Tom

e et al. proposed a

logistic but only partially non-negative matrix factor-

ization, where the model allows for negative feature

components (Tom

e et al., 2015), whereas our method

is strictly non-negative and explicit modelling of the

threshold in the logistic sigmoidal. Tom

e et al. fur-

ther extended the model with a Lagrangian penalty on

the two norm of the columns of W and H. Both our

method and the methods by Tom

e et al. uses a gra-

dient based update scheme. Tom

e et al. uses a con-

stant step length, where we use an adaptive scheme

to ensure non-negativity. Tom

e et al. ensures non-

negativity by projection. In order to evaluate how well

the model generalizes the data, Tom

e et al. uses a set-

up with a test- and training-set, while we have gen-

eralized our method to handle missing data, thus en-

abling the use of cross-validation. The methods pro-

posed by Tom

e et al. is tested on synthetic data with

binary basis vectors and the USPS digits. The em-

phasis is put on how well the model reconstructs data,

while we focus on estimating the correct number of

feature vectors and the interpretation of the model.

Furthermore we test our model on sensory data and

text data.

The training process and selecting the model com-

plexity is another issue regarding NMF. Nielsen and

Mørup proposed to marginalize missing data in order

to perform cross-validation (CV) to choose the num-

ber of components in the model (Nielsen and Mørup,

2014), and we will use this approach in the paper.

Larsen, J. and Clemmensen, L..

Non-negative Matrix Factorization for Binary Data.

In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2015) - Volume 1: KDIR, pages 555-563

ISBN: 978-989-758-158-8

555

2 METHOD

Non-negative Matrix Factorization (NMF) belongs to

the family of Factor Analysis methods and was pro-

posed by Lee and Seung (Lee and Seung, 1999; Lee

and Seung, 2001). As the name states, it computes a

low rank approximation of a M × N data matrix, X,

consisting of the non-negative matrices W ∈ R

M×K

with entries w

i,d

, and H ∈ R

K×N

, with entries h

d, j

such that x

i, j

≈

∑

d=1

i,d

d, j

. Generally the problem

is formulated as in (1), where D(·, ·) is a distance mea-

sure. The cost function of interest obtained by letting

C(W, H) = D(X,W H).

min

W ∈R

M×K

,H∈R

K×N

D(X,W H), st. W ≥ 0, H ≥ 0 (1)

In this article, three different measures are used:

Least Squares, KL-divergence and cross-entropy. The

matrices W and H are computed using Multiplicative

Updates (MU) described in Section 2.2, 2.3 and 2.4.1.

Other methods for computing W and H are described

in (Gillis, 2014, §3.1).

The derivative with respect to an arbitrary element

m,n

or h

m,n

of either W or H, which for a general

purpose will be termed β

m,n

, with respect to a general

cost function C(·), can be decomposed into a posi-

tive and a negative part (2a). Using a gradient de-

scent method with the step size (2b), this results in

the generic update formula (2c). This formula can be

shown to converge towards a non-negative solution

(Lee and Seung, 2001). The problem of ﬁnding the

optimal solution is NP-hard (Gillis, 2014), this mean

that an update procedure will converge towards a local

minimum.

∂C(β)

∂β

m,n

∂C(β)

∂β

m,n

−

∂C(β)

∂β

m,n

−

(2a)

m,n

= β

m,n

− η

m,n

∂C(β)

∂β

m,n

, η

m,n

∂C(β)

∂β

m,n

(2b)

m,n

= β

m,n

∂C(β)

∂β

m,n

−

∂C(β)

∂β

m,n

(2c)

It has been shown that NMF-algorithms improves

convergence speed by updating the same factor mul-

tiple times (Gillis and Glineur, 2012). In this paper

each factor was updated 10 times for each iteration.

2.1 Selecting the Number of

Components

In order to determine the optimal number of compo-

nents K in W and H, Nielsen and Mørup proposed a

marginalization approach for handling missing data as

an alternative to Expectation Maximization (Nielsen

and Mørup, 2014). The method uses an indicator ma-

trix R, with entries r

i, j

, that is 1 if x

i, j

is present, and

0 otherwise. This enables the use of Cross Validation

(CV) to estimate the generalization error of a model,

the ”one standard-error rule” (Hastie et al., 2009) can

then be used to select the optimal number of compo-

nents.

2.2 Least Squares

The Least Squares NMF uses the Frobenius norm as

distance measure, resulting in the objective (3a). To-

gether with the non-negativity constraints of W and H

this constitutes the problem formulation. The multi-

plicative update formulas for the Least Squares for-

mulation of the Non-negative Matrix Factorization

(3b) and (3c) are based on taking gradient steps and

step sizes of (3a) as deﬁned in (2b).

∑

i, j

− (W H)

i, j

)

(3a)

i,d

← w

i,d

(XH

)

i,d

(W HH

)

i,d

(3b)

d, j

← h

d, j

W H)

d, j

(3c)

The marginalization approach uses the slightly modi-

ﬁed objective function (4a). (4b) and (4c) deﬁnes the

gradients. By using step size 2b, the resulting MU

formulas are (4d) and (4e).

∑

i, j

− (W H)

i, j

)

(4a)

∇

i,d

∑

i, j

((W H)

i, j

− x

i, j

d, j

(4b)

∇

d, j

∑

i, j

((W H)

i, j

− x

i, j

i,d

(4c)

i,d

← w

i,d

∑

i, j

d, j

∑

i, j

(W H)

i, j

d, j

(4d)

d, j

← h

d, j

∑

i, j

i,d

∑

i, j

(W H)

i, j

i,d

(4e)

2.3 KL-divergence

Non-negative data are poorly approximated by a nor-

mal distribution, (Lee and Seung, 1999) proposes to

use the divergence (5e) instead. The term ’divergence’

is used instead of distance, since the measure is not

symmetric in X and W H. This corresponds to a model

in which x

i, j

has a Poisson distribution with mean

(W H)

i, j

(Hastie et al., 2009, §14.6) and has received

SSTM 2015 - Special Session on Text Mining

556

its name since it reduces to the Kullback-Leibler di-

vergence for

∑

i, j

∑

i, j

(W H)

i, j

= 1 (Lee and Se-

ung, 2001).

The formulas for the multiplicative update scheme

are given in Equation (5d) and (5e), they are derived

from (5a) using the derivative and step size given in

(5b) and (5c).

= −

∑

i, j

log(W H)

i, j

− (W H)

i, j

(5a)

∇

i,d

∑

d, j

−

∑

i, j

d, j

(W H)

i, j

, η

i,d

∑

d, j

(5b)

∇

d, j

∑

i,d

−

∑

i, j

i,d

(W H)

i, j

, η

d, j

∑

i,d

(5c)

i,d

← w

i,d

∑

i, j

(W H)

i, j

d, j

∑

d, j

(5d)

d, j

← h

d, j

∑

i,d

i, j

(W H)

i, j

∑

i,d

(5e)

Similar to the Least Squares setting, the marginal-

ization approach can also be applied to the KL-

divergence. The modiﬁed cost function is given in

(6a). The multiplicative update formulas (6d) and (6e)

are derived using the generic approach (2a)-(2c) with

derivatives given in (6b) and (6c).

∑

i, j

log(W H)

i, j

− (W H)

i, j

) (6a)

∇

i,d

∑

i, j



d, j

−

i, j

d, j

(W H)

i, j



(6b)

∇

d, j

∑

i, j



i,d

−

i, j

i,d

(W H)

i, j



(6c)

i,d

← w

i,d

∑

i, j

(W H)

i, j

d, j

∑

i, j

d, j

(6d)

d, j

← h

d, j

∑

i,d

i, j

(W H)

i, j

∑

i, j

i,d

(6e)

2.4 Logistic NMF

For binary data a general NMF is not optimal in the

sense that it maps onto the entire positive real space

of numbers. The Logistic Non-negative Matrix Fac-

torization is therefore proposed. The model is a gen-

erative model formed by a combination of NMF and

logistic regression. The model is given in Equation

(7) with y

i, j

= p(1|

∑

i,d

d, j

, c

i, j

) being the proba-

bility of y

i, j

being 1 and σ(·) being the logistic sig-

moid function (8a). The threshold c

i, j

is estimated in

two different ways: a global constant applicable for

all y

i, j

and a rank 1 approximation u

. See the de-

scription below.

i, j

∑

i,d

d, j

− c

i, j

, w

i,d

, h

d, j

, c

i, j

≥ 0 (7a)

i, j

= σ(a

i, j

) = p(1|

∑

i,d

d, j

, c

i, j

) (7b)

p(0|

∑

i,d

d, j

, c

i, j

) = 1 − p(1|

∑

i,d

d, j

, c

i, j

)

(7c)

σ(a) =

1 + exp(−a)

(8a)

dσ

= σ(1 −σ) (8b)

General Cost Function. The model (7) applied to

a 0-1 coded matrix X, lead to the Likelihood function

(9a) and the negative log Likelihood (9b).

p(X|W, H, c

i, j

) =

∏

i, j

(1 − y

i, j

)

1−x

i, j

(9a)

(W, H, c

i, j

) = −log (p (X|W, H, c

i, j

))

= −

∑

i, j

log(y

i, j

) + (1 − x

i, j

)log(1 − y

i, j

) (9b)

Marginalized Cost Function. In case of missing

data, a generalization of (9) is given as the marginal-

ized Likelihood function in (10a) and the correspond-

ing negative log Likelihood (10b).

(X|W, H, c

i, j

) =

∏

i, j



i, j

(1 − y

i, j

)

1−x

i, j



i, j

(10a)

(W, H, c

i, j

) = −log (p

(X|W, H, c

i, j

))

= −

∑

i, j

log(y

i, j

) + (1 − x

i, j

)log(1 − y

i, j

))

(10b)

2.4.1 Determining the Parameters

In order to determine the parameters of the model,

the optimization problem (11) is formulated for the

marginalized negative log Likelihood, as:

min

W,H,c

i, j

(W, H, c

i, j

) (11a)

subject to. W ∈ R

M×K

, W ≥ 0 (11b)

H ∈ R

K×N

, H ≥ 0 (11c)

i, j

∈ R, c

i, j

≥ 0 (11d)

Non-negative Matrix Factorization for Binary Data

557

Global Threshold. Using a global constant c

i, j

= c,

the optimization problem is convex (Boyd and Van-

denberghe, 2009, §7.1) separately in W , H and c,

henceforth this variant is known as Global thresh

NMF. Therefore it is solved using MU based on a

gradient descend method. The partial derivatives with

respect to w

i,d

, h

d, j

and c are given in (12). They are

derived using the Chain Rule, and the derivative of the

logistic sigmoid function given in (8b).

∂C

R(W, H, c)

∂w

i,d

∑

i, j

− x

i, j

d, j

(12a)

∂C

(W, H, c)

∂h

d, j

∑

i, j

− x

i, j

i,d

(12b)

∂C

(W, H, c)

∂c

∑

i, j

− y

i, j

) (12c)

Using the step size deﬁned in (2b), this leads to the

multiplicative update formulas (13).

i,d

← w

i,d

∑

i, j

d, j

∑

i, j

d, j

(13a)

d, j

← h

d, j

∑

i, j

i,d

∑

i, j

i,d

(13b)

c ← c

∑

i, j

∑

i, j

(13c)

Rank 1 Approximation of Threshold. When in-

troducing a rank 1 approximation of the threshold for

each (i, j), such that c

i, j

= u

, the problem is still

convex in u

and v

separately, henceforth this method

is known as Max thresh NMF. Let u

be the i’th row of

U and v

be the j’th column of V , the partial deriva-

tives are then as shown in (14)

∂C

(W, H,U,V )

∂u

∑

i, j

− y

i, j

(14a)

∂C

(W, H,U;V )

∂v

∑

i, j

− y

i, j

(14b)

Using the step size deﬁned in (2b), the update formu-

las are then given in (15).

← u

∑

i, j

∑

i, j

(15a)

← v

∑

i, j

∑

i, j

(15b)

2.4.2 Constraints

The update formulas introduced in (13c), (15a) and

(15b) introduce a risk of dividing by zero, or in the

case of a sparse matrix, the numerator may be much

larger than the denominator. This, together with (7a)

introduces the risk of both W H and c

i j

exploding in

size. This is avoided by adding the constraint (16) to

the optimization problem (11) with λ being a positive

constant.

i, j

≤ λ (16)

3 SIMULATED DATA

3.1 Generating Simulated Data

The four NMF variants are compared on simulated

data where we know the true underlying components.

The data simulates text data, which is ensured by

putting the following conditions on W and H.

1. Columns of W are sparse

∑

= 1

3. Rows of H are exponentially distributed

1.: Each component represent a topic (Gillis, 2014),

but each topic is not necessarily present in all doc-

uments. 2.: The pure documents do not contain

any noise, and are therefore represented only by the

constructed topics. 3.: Terms are exponentially dis-

tributed in natural languages (Paukkeri, 2012).

To investigate how the two Logistic NMF methods

behave, compared to LS NMF and KL NMF when

applied to text data, three problems have been created,

with varying degree of sparsity in the score vectors

P. 1 25 % entries in score vectors

P. 2 50 % entries in score vectors

P. 3 75 % entries in score vectors

The data matrix

X is then generated as described in

(17), with τ being a positive constant.

a = W H − τ (17a)

X =

1 + exp(−a)

(17b)

X = round(X ) (17c)

3.2 Analysing Simulated Data

In all the problems, 4 topics are simulated with

columns of W and rows of H having length 50. The

resulting data matrix is therefore of size 50 × 50. The

problems are simulated 100 times and modelled by

each NMF method. For each simulation, the dataset is

SSTM 2015 - Special Session on Text Mining

558

(a) Problem 1. (b) Problem 2.

Figure 1: Box plots of estimated number of components.

Length of whiskers are 1.5 IQR. Dashed line indicate true

number of components.

divided into 11 parts, the ﬁrst 10 parts are used to es-

timate the optimal number of components by 10 fold

CV. The last part is used to report how well the model

predicts unknown data.

Analysing the Estimated Number of Components.

Figure 1 shows box plots of the estimated number of

components for each method. By performing a t-test

with signiﬁcance level α = 0.05 it is concluded that

none of the methods estimate the correct number of

components (4). By using a Welch t-test at signiﬁ-

cance level α = 0.05 it is tested whether the methods

estimate different number of components. Instances

where the test level is not signiﬁcant are written with

bold face in Table 1.

Table 1: P-values for problems and methods where the

methods estimate the same number of components.

P. 1 P. 2 P. 3

LS vs. Glob. thresh 0.19 1.2 ·10

−6

< 10

−10

KL vs. Max thresh 0.087 2.7 ·10

−8

< 10

−10

Glob. vs. Max 0.22 0.23 0.62

Analysing the Error Level. Neither Least Squares

error (3a), KL-divergence (5a) or Cross-entropy (9b)

are directly comparable. Hence only the two Logistic

NMF methods are comparable in terms of prediction

error. Figure 2 shows box plot of the mean cross-

entropy of prediction when using the optimal number

of components. Using a Welch t-test, a signiﬁcance

level α = 0.05 reveal that for all of the problems, the

(a) Problem 1. (b) Problem 2.

Figure 2: Comparison of the mean cross-entropy of predic-

tion for Max NMF and Const NMF with optimal number of

components. Length of whiskers are 1.5 IQR.

prediction error is considered the same for the two

methods. The p-values are 0.19, 0.12 and 0.08 for

Problem 1, 2 and 3 respectively.

(a) Problem 1. (b) Problem 2.

Figure 3: Comparison of the mean cross-entropy of pre-

diction for Max NMF and Const NMF with 4 component

models. Length of whiskers are 1.5 IQR.

Figure 3 shows box plot of the mean cross-entropy

of prediction when 4 components are used for both

Global thresh NMF and Max thresh NMF. A Welch

t-test with signiﬁcance level α = 0.05 is performed to

Non-negative Matrix Factorization for Binary Data

559

(a) LS NMF (b) KL NMF

Figure 4: CV error for text data. (a) and (b): Mean CV error

with standard error bars. (c) and (d): Mean CV error sub-

tracted one standard error. Circles and crosses mark the op-

timal model complexity with regard to ”one standard-error

rule” (Hastie et al., 2009).

test whether the error levels are signiﬁcantly different.

The test value for all three problems are p < 10

−10

, re-

vealing that for all three problems Global thresh NMF

have lower prediction error than Max thresh NMF.

4 REAL DATA

Two different real datasets are analysed using the four

NMF variants: A dataset consisting of student com-

ments and a sensory dataset (Randall, 1989).

4.1 Text Data

A collection of 10579 student comments on pro-

fessors at various American universities collected

at www.ourumd.com is being analysed. After pre-

processing (including correcting misspellings, stem-

ming and replacing numbers and names with relevant

tags) the collection holds 9400 different words. The

resulting Term Document Matrix (TDM) X is there-

fore of size 10579 × 9400. The estimated models

are analysed using the procedure described in (Gillis,

2014). The interpretation of W and H are swapped,

since the i’th document is collected in the i’th row of

The ability of each method to generalize the data,

is estimated using 5-fold CV. In order to avoid local

Table 2: Most dominant words for LS NMF.

Component 1

exam lectur num grade

question ﬁnal test class

help name book studi

good materi point homework

hard problem note read

Component 2

num class name professor

cours interest great !

student teacher easi lot

recommend teach learn good

paper materi read make

Table 3: Most dominant words for KL NMF.

Component 1

num class name professor

cours lectur exam grade

good easi lot student

help materi test read

teacher teach ! question

Table 4: Most dominant words for Global thresh NMF

Component 1

num lectur exam class

question studi easi onlin

slide answer post attent

test note choic pai

multipl hard name attend

Component 2

num class name cours

professor exam good materi

help lectur cover start

student prepar level come

found teach will school

Component 3

num class name particip

essai group present grade

easi fun paper interest

core read project !

short page midterm requir

minima, each fold is re-estimated 10 times. Figure 4

shows the estimated generalization error and the cho-

sen model complexity for each method. The dominant

words of each topic are shown in Tables 2-5.

4.2 Wine Data

The four variants of NMF are applied to the ’Bitter-

ness of Wine’ dataset (Randall, 1989). The data is

SSTM 2015 - Special Session on Text Mining

560

Table 5: Most dominant words for Max thresh NMF.

Component 1

ofﬁc ve look complet

tell averag feel major

hour practic high sai

ask school minut num

show mean gener give

Component 2

multipl post answer question

choic ﬁnal slide onlin

concept attend class num

midterm grade point requir

exam semest studi consist

represented by introducing the 7 binary variables

• Contact (1 if ”yes”, 0 if ”no”)

• Temperature (1 if ”warm”, 0 if ”cold”)

• Rating 1

• Rating 2

• Rating 3

• Rating 4

• Rating 5

The generalization error of each of the models is es-

timated using 10-fold Cross validation. In order to

avoid local minima in the training process, each fold

is trained 30 times and the model with lowest val-

idation error is reported. The model complexity is

chosen using ”one standard-error rule” (Hastie et al.,

2009). The estimated generalization error and chosen

model complexities are shown in Figure 5.

Figure 6 shows the components of the estimated 6

component model using Global thresh NMF. From the

rows of H it is seen the components basically describe

variable each - except for Contact which is described

by both component 5 and 6, further component 4 de-

scribe both Temperature and Rating 5.

5 DISCUSSION

5.1 Simulated Data

On the simulated problems we saw that the assump-

tions regarding the distribution of the noise, and

thereby the choice of method, inﬂuences the how

many components is estimated as being optimal. We

observed that all four methods have a tendency to es-

timate too many components as being optimal even

though the ”one standard-error” rule was used to

choose the number of components in the models.

(a) LS NMF (b) KL NMF

Figure 5: CV error for wine data. (a) and (b): Mean CV

error with standard error bars. (c) and (d): Mean CV er-

ror subtracted one standard error. Circles and crosses mark

the optimal model complexity with regard to ”one standard-

error rule” (Hastie et al., 2009).

When changing the number of components, the inter-

pretation of a model may change, as the distribution of

signal among the components is changed. Thus, the

use of extra components compared to the true number

of components (4) may hinder the interpretation of the

estimated model. A low generalization error is as im-

portant as estimating the true number of component

when building a model. Furthermore, it is observed

that the size of the error in estimating the number of

components, is problem dependent.

The error levels were compared for the Max thresh

NMF and Global thresh NMF. It was seen that when

the optimal number of components were used, the two

methods performed with the same error level. When

the true number of components were used, the Global

thresh NMF performed signiﬁcantly better than Max

thresh NMF.

5.2 Text Data

The components or topics extracted from the collec-

tion of student comments are clearly related to teach-

ing and exams. The logistic NMF’s are able to ex-

tract at least as many components from the data as LS

NMF and more components than KL NMF. This indi-

cate that KL NMF are not as suited for binary data as

the other methods.

Non-negative Matrix Factorization for Binary Data

561

(a) Columns of W plotted against each other.

(b) Bar plot of rows of H.

Figure 6: Visualization of the components of estimated 6

component model using Global thresh NMF.

5.3 Wine Data

When estimating the optimal number of components,

it was observed that this issue was dependent on the

method. LS NMF, KL NMF and Max thresh NMF

estimated that 1-2 components should be used while

Global thresh NMF estimated that 6 components were

optimal. Further, LS NMF and KL NMF was seen to

have large standard errors.

The 6 component model determined with Global

thresh NMF was presented. It was seen that each vari-

able to some degree was described by a component.

Furthermore it was observed that the variable Rating

1 was not described by any of the components. The

variable Rating 1 is 1 in 5 samples while being 0 in 67

samples, i.e. approx. 7% of the samples, which may

have inﬂuenced the model.

6 CONCLUSION

We presented the two well known variants of Non-

negative Matrix Factorization and proposed the logis-

tic Non-negative Matrix Factorization for binary data.

The proposed method was presented in two variants;

one with a global threshold for the logistic sigmoid

function and one with a rank one approximation (max

threshold).

The four NMF methods were applied to a col-

lection of student comments regarding professors at

various universities. It was seen that all methods

were able to extract components that were describ-

ing teaching and exam. Global thresh NMF and Max

thresh NMF were able to validate more components

than the two usual methods: LS NMF and KL NMF.

The ”Bitterness of Wine” dataset (Randall, 1989)

was analysed using all four NMF methods. It was

seen that methods which had a large standard error

when doing Cross Validation, either estimated few

components (LS, KL and Max) or many components

(Global).

The four NMF methods were also compared on

simulated data with four underlying components. The

methods had good error convergence. However, the

methods had a tendency to choose too many compo-

nents. Furthermore, it was seen that the bias in esti-

mating the number of components was problem de-

pendent.

The interpretation may change with a varying

number of components and as this is where we ob-

served the most issues for all four studies, we recom-

mend that future work should investigate methods to

estimate the number of components.

ACKNOWLEDGEMENTS

The authors would like to thank Jacob Kogan from

Department of Mathematics and Statistics at Univerty

of Maryland for collecting the student comments from

www.ourumd.com.

REFERENCES

Boyd, S. and Vandenberghe, L. (2009). Convex Optimiza-

tion. Cambridge University Press.

Gillis, N. (2014). The why and how of nonnegative matrix

factorization. ArXiv e-prints.

Gillis, N. and Glineur, F. (2012). Accelerated multiplicative

updates and hierarchical als algorithms for nonnega-

tive matrix factorization. NEURAL COMPUTATION,

24(4):1085–1105.

SSTM 2015 - Special Session on Text Mining

562

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The

Elements of Statistical Learning - Data Mining, Infer-

ence, and Prediction. Springer, 2nd edition.

Lee, D. D. and Seung, H. S. (1999). Learning the parts of

objects by non-negative matrix factorization. Nature,

401(6755):788–791.

Lee, D. D. and Seung, H. S. (2001). Algorithms for non-

negative matrix factorization. Advances in Neural In-

formation Processing Systems, 13(13):556–562.

Nielsen, S. F. V. and Mørup, M. (2014). Non-negative ten-

sor factorization with missing data for the modeling

of gene expressions in the human brain. In 2014 IEEE

International workshop on Machine Learning for Sig-

nal Processing.

Paukkeri, M.-S. (2012). Language- and domain- indepen-

dent text mining. Doctorial Dissertations. Aalto Uni-

versity.

Randall, J. (1989). The analysis of sensory data by gener-

alised linear model. Biometrical journal, 7:pp. 781–

793.

Tom

e, A. M., Schachtner, R., Vigneron, V., Puntonet, C. G.,

and Lang, E. W. (2015). A logistic non-negative ma-

trix factorisation approach to binary data sets. Multi-

dim Syst Sign Process, 26:125–143.

Zhang, Z., Li, T., Ding, C., and Zhang, X. (2010). Binary

matrix factorization with applications. Data Mining

and Knowledge Discovery, 20(1):28–52.

Non-negative Matrix Factorization for Binary Data

563