Table 1: Document classification sets.
news text proc. clustering size (no. of
fields params params clusters)
CS1 headline l f = 0 k ∈ [5..20] 4 (50)
body l f = {0, 1, 5} k ∈ [5..20] 12 (150)
CS2 headline l f = 5 k ∈ [5..43] 20 (480)
+ body
CS3 headline l f = 0 k ∈ [5..20] 4 (50)
body l f = 0 k ∈ [5..20] 4 (50)
metadata – – 3 (19)
ternative categorizations according to three different
category fields (metadata): TOPICS (i.e., major sub-
jects), INDUSTRIES (i.e., types of businesses dis-
cussed), and REGIONS (i.e., geographic locations and
economic/political side information). After filtering
out very short news (i.e., documents with size less
than 6KB) and any news that did not have at least one
value for each of the three category fields, we selected
the news labeled with one of the Top-5 categories for
each of the three category fields. This resulted in a
dataset of 3081 news. From the text of the news,
we discarded strings of digits, retained alphanumeri-
cal terms, performed removal of stop-words and word
stemming.
We generated various sets of classifications ob-
tained over the RCV1 dataset, according to the tex-
tual content information as well as to the Top-
ics/Industries/Regions metadata. For the purpose of
generating the text-based classifications, we used the
bisecting k-means algorithm implemented in the well-
known CLUTO toolkit (Karypis, 2007) to produce
clustering solutions of the documents represented
over the space of the terms contained in the body
and/or headlines. Table 1 summarizes the main char-
acteristics of the three sets of document classifications
used in our evaluation. Columns text proc. params
and clustering params refer to the lower document-
frequency cut threshold (l f , percent) used to select
the terms for the document representation, and to
the number of clusters (k, with increment of 5 in
CS1,CS3 and 2 in CS2) taken as input to CLUTO to
generate the text-based classifications. Moreover, col-
umn size reports the number of classifications and re-
lating number of clusters of documents (within brack-
ets) that rely on the same type of information (i.e.,
body, headline, metadata).
For each of the three document classification sets,
we derived different tensors according to various set-
tings of the closed frequent document-set extraction.
Table 2 contains details about the tensors built upon
the selected configurations. Note that, in each of the
tensors, mode-2 corresponded to the space of terms
extracted from the body and headline of the news
(2692 terms) and mode-3 to the average number of
clusters in the corresponding classification sets (i.e.,
13 for CS1, 24 for CS2, and 11 for CS3).
Table 2: Tensors and their decompositions.
min length no. of avg % of TD-S
of CDS CDS C DS per doc. size
CS1 Ten1 50 17443 3.29% 174 × 27 × 13
CS1 Ten2 100 5871 5.25% 58 × 27 × 13
CS1 Ten3 150 2454 7.12% 24 × 27 × 13
CS1 Ten4 200 1265 8.53% 12 × 27 × 13
CS2 Ten1 50 12964 3.78% 129 × 27 × 24
CS2 Ten2 100 7137 4.87% 71 × 27 × 24
CS2 Ten3 150 3129 5.89% 31 × 27 × 24
CS2 Ten4 180 918 7.53% 9 × 27 × 24
CS3 Ten1 50 2806 3.09% 28 × 27 × 11
CS3 Ten2 100 843 5.15% 8 × 27 × 11
CS3 Ten3 150 326 7.15% 3 × 27 × 11
Table 3: Summary of results.
TD-S clustering F Q TD-L clustering F Q
CS1 Ten1 monoth. 0.509 0.603 CS1 Ten1 direct 0.556 0.601
t f -proj. 0.610 0.838 t f -proj. 0.665 0.881
CS1 Ten2 monoth. 0.534 0.599 CS1 Ten2 direct 0.570 0.603
t f -proj. 0.625 0.838 t f -proj. 0.688 0.884
CS1 Ten3 monoth. 0.542 0.598 CS1 Ten3 direct 0.586 0.601
t f -proj. 0.624 0.835 t f -proj. 0.689 0.889
CS1 Ten4 monoth. 0.533 0.598 CS1 Ten4 direct 0.579 0.605
t f -proj. 0.624 0.838 t f -proj. 0.687 0.837
CS2 Ten1 monoth. 0.494 0.603 CS2 Ten1 direct 0.599 0.604
t f -proj. 0.569 0.847 t f -proj. 0.625 0.893
CS2 Ten2 monoth. 0.496 0.603 CS2 Ten2 direct 0.556 0.601
t f -proj. 0.561 0.843 t f -proj. 0.629 0.889
CS2 Ten3 monoth. 0.495 0.603 CS2 Ten3 direct 0.560 0.604
t f -proj. 0.570 0.846 t f -proj. 0.635 0.895
CS2 Ten4 monoth. 0.497 0.604 CS2 Ten4 direct 0.555 0.602
t f -proj. 0.577 0.848 t f -proj. 0.639 0.890
CS3 Ten1 monoth. 0.556 0.597 CS3 Ten1 direct 0.619 0.600
t f -proj. 0.617 0.837 t f -proj. 0.677 0.888
CS3 Ten2 monoth. 0.556 0.597 CS3 Ten2 direct 0.619 0.599
t f -proj. 0.620 0.837 t f -proj. 0.686 0.839
CS3 Ten3 monoth. 0.553 0.597 CS3 Ten3 direct 0.610 0.596
t f -proj. 0.620 0.837 t f -proj. 0.680 0.887
For each of the tensors constructed, we run the al-
gorithm in Figure 2 with different settings to obtain
two decompositions: the first one led to a core-tensor
with a number of components on mode-3 equal to the
average number of clusters in the original classifica-
tion set, whereas the other two modes were set equal
to the number of closed document-sets and number
of terms, respectively, scaled by a factor of 0.01; the
second decomposition was devised to obtain a larger
core-tensor with components of each mode equal to
an increment of a multiplicative factor of 10 w.r.t. the
mode in the core-tensor obtained by the first decom-
position. We use suffixes TD-S and TD-L to denote
the first (smaller) and second (larger) decompositions
of a tensor, respectively; last column in Table 2 con-
tains details about TD-S decompositions. From the
result of a TD-S (resp. TD-L) decomposition, we de-
rived a monothetic (resp. direct) or, alternatively, a
t f -projected clustering solution, with number of clus-
ters equal to the number of mode-3 components.
All clustering solutions were evaluated in terms
of average F-measure (Steinbach et al., 2000) (F)
between a clustering solution derived from the ten-
sor model and each of the input document classifi-
cations. By using the original t f .id f representation
of the documents (based on the text of body plus
ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods
204