is estimated by ˆz
i
= k. This same row can then be
represented at the bidimensional coordinates ˆs
MAP
i
=
s
ˆz
i
= (s
ˆz
i
1
,s
ˆz
i
2
)
T
. By performing this procedure for
each row i, the model builds a reduced view of the
n categories of I. Moreover, when two nodes have
their coordinates s
k
and s
k
′
near in the latent space,
their corresponding clusters should have similar pa-
rameters α
kℓ
and α
k
′
ℓ
, so their corresponding contents
should be also similar. A fuzzy projection can be ob-
tained by computing an average position of each row
i from its posterior probabilities ˆc
ik
. This is written
ˆs
i
=
∑
g
k=1
ˆc
ik
(s
k1
,s
k2
)
T
. If the vector of probabilities
( ˆc
i1
, ˆc
i2
,··· , ˆc
ig
) is binary, then the row i is in the clus-
ter ˆz
i
, and ˆs
i
= ˆs
MAP
i
. This is generally the case for
GTM for a large part of the dataset. In the experimen-
tal part, it is constructed only a tabular view after bi-
narizing these vectors of probabilities, except a small
illustrative example.
3.2 Datasets
The characteristics of the four real datasets are de-
scribed below.
- N4. This dataset is composed of 400 documents
selected from a textual corpus of 20000 usenet
posts from 20 original newsgroups. From each
group among the 4 retained, 100 posts are selected
and 100 terms are filtered by mutual information
(Kab´an and Girolami, 2001).
- Binary
1
. This dataset in (Slonim et al., 2000)
consists of 500 posts separated into two clus-
ters for the newsgroups talk.politics.mideast and
talk.politics.misc. A preprocessing was carried
out by the authors to reduce the number of words
by ignoring all file headers, stop words and nu-
meric characters. Moreover, using the mutual in-
formation, the top 2000 words were selected.
- Multi5
1
. This dataset in (Slonim et al., 2000),
consists of 500 posts separated into five clus-
ters comp.graphics, sci.space, rec.motorcycles,
res.sports.baseball and talk.politics.mideast. The
same pre-processing than for Binary
1
was per-
formed.
- C3. This dataset in (Dhillon et al., 2003), also
known as Classic3, is often used as a benchmark
for co-clustering. This dataset is a contingency
table of size 3891 x 4303 and it is compound of
three classes denoted Medline, Cisi and Cranfield
as in the larger complete data sample not consid-
ered here.
These four datasets studied in our experiments are
of increasing size.
3.3 Results
Table 1 summarizes the characteristics of the datasets
and the parameters for BlockGTM. The four con-
structed maps are squares of size g = 9 × 9 for the
clustering in rows, while the number of clusters m in
columns and the dimension h were chosen after a few
tries. Each map is represented as following. For each
k-th cluster, a barplot corresponding to the true labels
of the data in the cluster is constructed at position s
k
after fitting the model. The results are shown in Fig-
ure 2 for Multi5
1
and C3. So, for a given dataset
the map shows a matrix of 9 × 9 barplots such as if
two nodes are close on the latent space they should
have similar barplots. This is a tabular view of the
data (categories I) which confirms also that the near-
est clusters have their texts with similar topics as ex-
pected.
Table 1: Summary where n×d is the size of the contingency
table, m is the number of clusters in columns, h is the num-
ber of basis functions, and E
r1
(resp. E
r2
) is the accuracy in
percent from BlockGTM (resp. PLBM).
Data n d m h E
r1
E
r2
N4 400 100 10 12 96.5 93.4
Binary
1
400 100 10 19 91.2 92.4
Multi5
1
500 2000 20 19 90.6 89.0
C3 3891 4303 20 28 99.1 99.3
In this section we are interested on measuring how
well the co-clustering can reveal the inherent struc-
ture of a given textual dataset. We consider the ac-
curacy which is usually derived from the confusion
matrix or the cluster purity. Specifically, we mea-
sure the quality of the clustering for the obtained clus-
ters comparatively to the real categories of the docu-
ments. The columns E
r1
and E
r2
of Table 1 give, in
percent, the accuracy obtained respectively for Block-
GTM and PLBM initialized with the final parameters
of BlockGTM.
- For N4, the categories of I are projected by the
Correspondence Analysis (CA) method (Benze-
cri, 1980). The coordinates from CA are used
to compute the positions of the mean centers in
a 3-dimensional space thanks to the quantities ˆc
ik
.
Figure 3 shows the result. It is interesting to note
that the original mesh compound of the nodes S
in the latent space is easily recognized in this 3-
dimensional space. Here each class is quantized
by a subset of clusters from the map, and the sub-
set usually includes only data with their corre-
sponding projections close in the space of projec-
tion as expected.
- For C3, the proposed method extracts the origi-
ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods
66