Through the analysis of the clusters, we found fol-
lowing points on the threshold value of similarities.
• If the threshold value is high, precise but small
clusters are generated.
• As the threshold value becomes low, clusters in-
clude improper threads whose contents are differ-
ent from contents of the clusters.
The threads in a cluster include “characteristic
words” which represent a content of the cluster. How-
ever, non-characteristic words are also used for a cal-
culation of the similarity. So, a similarity between a
cluster and an improper thread to a content of the clus-
ter may be over the threshold value, which causes that
the cluster can contain the improper thread. There-
fore, we propose a clustering method by reflecting
characteristics of words to the similarity. The pro-
posed method uses “category dictionary” that has val-
ues indicating how characteristic the words in each
cluster are. In the dictionary, characteristic words
have high values, and non-characteristic ones have
small values. These are weighted to the similarity
so as to reflect the characteristics. In order to gen-
erate precise clusters, the dictionary needs to have
enough words and appropriate values of weights for
the words. However, the construction of the dictio-
nary is time-consuming task for operators. So, it is
necessary to generate clusters and update the dictio-
nary automatically and accurately.
Figure 1 shows the flow of extracting candidates
of FAQ by clustering method that consists of the fol-
lowing three steps:
(1) Making Core Clusters by a High Strictly Thresh-
old Value: In order to ensure the accuracy at the
beginning of the clustering, the small but precise
clusters (core clusters) are generated by hierarchi-
cal clustering with a high threshold value. And
values in the dictionary are decided as tf-idf(term
frequency inverse document frequency) : words’
typical indicators for characteristics (Salton and
McGill, 1983).
(2) Expanding Clusters by an Appropriately-
loosened Low Threshold Value: The small cluster
Figure 1: Overview of clustering with dictionary.
is not regarded as candidate FAQ, because it is
thought that the content of the small cluster is not
a frequent inquiry. Therefore, core clusters are
expanded with a low threshold value by referring
the category dictionary.
(3) Cleansing Clusters: Improper threads in a cluster
are removed from the cluster.
Theses three steps need thresholdvalues, which
are impracticable to set appropriately by hand. There-
fore we also propose an automatic setting mechanism
of these threshold values.
2.2 Construction of Core Clusters
Core clusters should be constructed precisely for
making the dictionary that has appropriate informa-
tion of characteristics of words in order to generate
correct clusters in the later steps. Therefore, core
clusters have to be constructed with strictly similar
threads to each other. This similarity index is used
in clustering and calculated from the weighted sum of
the Cosine similarity between inquiries of threads and
the Cosine similarity between replies of threads.
Sim(Th
i
,Th
j
)=(1−α)cosSimQ
i, j
+αcosSimA
i, j
(1)
cosSimQ
i, j
=
~
Q
i
·
~
Q
j
||
~
Q
i
|| ||
~
Q
j
||
, cosSimA
i, j
=
~
A
i
·
~
A
j
||
~
A
i
|| ||
~
A
j
||
Th
i
is a thread of
~
Q
i
and
~
A
i
,
~
Q
i
is a vector of word
frequencies in an inquiry of Th
i
, and
~
A
i
a vector of
word frequencies in a reply of Th
i
. The similarity in-
dex is derived as Sim(), cosSimQ
i, j
is the similarity
between inquiries Q
j
,Q
i
, cosSimA
i, j
is the similarity
between replies of A
j
,A
i
and α(0 < α < 1) is a con-
stant value to reflect which similarities can be used
for the clustering. The replies are usually written by
specific operators and the words used in the replies of
the same content are similar. Therefore α might be
larger than 0.5.
After the construction of core clusters, a category
dictionary is generated from the core clusters. This
category dictionary is referred in the expansion and
sophistication of clusters. The category dictionary
keeps tf-idf value of each word in each cluster as a
typical indicator for characteristics of each cluster. A
tf-idf value of Word
s
gets a high value if the word ap-
pears frequently in the thread Th
i
and the number of
clusters containing the word is small.
t f -id f(Th
i
, Word
s
) = t f
i,s
× id f
s
t f
i,s
=
Freq. of Word
s
in Th
i
Num. of all words in Th
i
id f
s
= log
Num. of all clusters
Num. of clusters including Word
s
ICEIS2012-14thInternationalConferenceonEnterpriseInformationSystems
200