OLAP analysis. Recently, document warehousing (a
set of approaches for analysis, sharing, and reusing
unstructured data, such as textual data or documents)
has become an important research field. Many issues
are still open but we are more interested in taking into
account the textual content of data in the OLAP anal-
ysis. In this context the measure can be textual (like
a list of keywords), so adapted aggregation functions
for textual measure are needed.
In this paper, the main contribution is to provide
an OLAP aggregation function for textual measure.
This function allows an analysis based on keyword
measures for a multidimensional document analysis.
From the literature of keywords aggregation, we clus-
ter the existing methods into four groups. The first
one is based on linguistic knowledge, the second one
on external knowledge, the third is based on graphs,
while the last one is based on statistical methods.
Our approach falls in the latter category. The exist-
ing approaches using statistical methods focus mainly
on the frequencies of keywords. However, the ap-
proach that we propose uses a well known data min-
ing technique, which is the k-means algorithm, with
a distance based on the Google similarity distance.
The Google similarity distance has been proposed by
Google and has been tested in more than eight bil-
lion of web pages (Cilibrasi and Vitanyi, 2007). The
choice of this distance is motivate by the fact that
it takes into account the semantic similarity of key-
words. We name our approach GOTA Google sim-
ilarity distance in OLAP Textual Aggregation. The
performance of our approach is analyzed and com-
pared to another method using the k-bisecting cluster-
ing algorithm with the Jensen-Shannon divergence for
the probability distributions (Wartena and Brussee,
2008). The rest of the paper is organized as follows:
Section 2 is devoted to related work to textual ag-
gregation. In Section 3, we introduce our proposed
approach. In Section 4, we present the experimental
study which includes a comparison with another ap-
proach. Finally, Section 5 concludes the paper and
provides future developments.
2 RELATED WORK
In literature, there are many approaches for aggregat-
ing keywords. We cluster them into four categories,
the first one is based on linguistic knowledge; the sec-
ond one is based on the use of external knowledge, the
third one is based on graphs, and the last one uses sta-
tistical methods.
The approaches based on linguistic knowledge
consider a corpus as a set of the vocabulary mentioned
in the documents; but the results in this case are some-
times ambiguous. To overcome this obstacle, few
techniques based on lexical knowledge and syntactic
knowledge previews have been introduced. In (Pou-
dat et al., 2006) and (Kohomban and Lee, 2007), the
authors proposed a classification of textual documents
based on scientific lexical variables of the discourse.
Among these lexical variables, they chose nouns be-
cause they are more likely to emphasize the scientific
concepts, rather than adverbs, verbs or adjectives.
The approaches based on the use of external
knowledge select certain keywords that represent a
domain. These approaches often use knowledge such
as an ontology. The authors in (Ravat et al., 2007)
proposed an aggregation function that takes a set of
keywords as input and the output is another set of ag-
gregated keywords. They assumed that both the ontol-
ogy and the corpus of documents belong to the same
domain. The authors in (Oukid et al., 2013) , pro-
posed an aggregation operator Orank (OLAP rank)
that aggregated a set of documents by ranking them
in a descending order, they used a vector space rep-
resentation. In (Subhabrata and Sachindra, 2014), the
authors developed a textual aggregation model using
ontology and they build keywords ontology tree.
The approaches based on graphs used keywords
to construct the keywords-graph. The nodes represent
keywords obtained after pre-processing, candidate se-
lection and edge representation. After the graph rep-
resentation step, different types of keywords-ranking
approaches have been applied. The first proposed
approach in (Mihalcea and Tarau, 2004) is called
TextRank, where graph nodes are the keywords and
edges represent the co-occurrence relations between
the keywords. The idea is, if a keyword gets linked
from a large number of other keywords, then that key-
word is considered as important.
The approaches based on statistical methods, used
the occurrence frequencies of terms and the correla-
tion between terms. In (Kimball, 2003), the author
proposed the method LSA (Latent Semantic Anal-
ysis) in which the corpus is represented by a ma-
trix where the rows represent the documents and the
columns represent the keywords. An element of the
matrix represents the number of occurrences of a
word in a document. After decomposition and reduc-
tion, this method provides a set of keywords that rep-
resent the corpus. The authors of (Hady et al., 2007)
proposed an approach called TUBE (Text-cUBE) to
discover the associations among entities. The cells of
the cube contain keywords, and they attach to each
keyword an interestingness value. (Bringay et al.,
2010) proposed an aggregation function based on a
new adaptive measure of t f .id f which takes into ac-
ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems
122