ods; the second one is based on linguistic knowledge;
the third one is based on graphs; while the last uses
external knowledge.
The approaches based on statistical methods use
the occurrence frequencies of terms and the correla-
tion between terms to extract the keywords. Hady
et al. (Hady et al., 2007) proposed an approach
called TUBE (Text-cUBE). They adopted a relational
database to textual data based on the cube design,
each cell contains keywords, and they attached to
each keyword an interestingness value. Zhang et
al. (Zhang et al., 2009) proposed an approach called
Topic Cube. The main idea of a Topic Cube is to
use the hierarchical topic tree as the hierarchy for
the text dimension. This structure allows users to
drill-down and roll-up along this tree. users discover
also the content of the text documents in order to
view the different granularities and levels of topics
in the cube. The first level in the tree contains the
detail of topics, the second level contains more gen-
eral types and the last level contains the aggregation
of all topics. A textual measure is needed to aggre-
gate the textual data. The authors proposed two types
of textual measures, word distribution and topic cov-
erage. The topic coverage computes the probability
that a document contains the topic. These measures
allow user to know which topic is dominant in the set
of documents by aggregating the coverage over the
corpus. Ravat et al. (Ravat et al., 2008) proposed
an aggregation function called TOP-Keywords to ag-
gregate keywords extracted from documents. They
used the t f .id f measure, then they selected the first
k most frequent terms. Bringay et al. in (Bringay
et al., 2011) proposed an aggregation function, based
on a new adaptive measure of t f.id f . It takes into
account the hierarchies associated to the dimensions.
Wartena et al. (Wartena and Brussee, 2008) proposed
another method we called TOPIC in which they used
the k-bisecting clustering algorithm and based on the
Jensen-Shannon divergence for the probability dis-
tributions as described in (Archetti and Campanelli,
2006). Their method starts with the selection of two
elements for the two first clusters. are assigned to the
cluster of the closest of the two selected elements.
Once all the terms are assigned, the process will be
repeated for each cluster with a diameter larger than a
specified threshold value. Bouakkz et al. (Bouakkaz
et al., 2015) proposed a textual aggregation based
on keywords. When a user wants to obtain a more
aggregate view of data, he does a roll-up operation
which needs an adapted aggregation function. their
approach entitled GOTA is composed of three main
parts, including: (1) extraction of keywords with their
frequencies; (2) construction of the distance matrix
between words using the Google similarity distance;
(3) applying the k-means algorithm to distribute key-
words according to their distances, and finally (4) se-
lection the k aggregated keywords.
The approaches based on linguistic knowledge
consider a corpus as a set of the vocabulary men-
tioned in the documents; but the results in this case
are sometimes ambiguous. However, to overcome this
obstacle, techniques based on lexical knowledge and
syntactic knowledge previews have been introduced.
In (Poudat et al., 2006; Kohomban and Lee, 2007)
the authors described a classification of textual doc-
uments based on scientific lexical variables of dis-
course. Among these lexical variables, they chose
nouns because they are more likely to emphasize the
scientific concepts, rather than adverbs, verbs or ad-
jectives.
The approaches based on the use of external
knowledge select certain keywords that represent a
domain. These approaches often use models of
knowledge such as ontology. Ravat et al. proposed an
other aggregation function that takes as input a set of
keywords extracted from documents of a corpus and
that outputs another set of aggregated keywords (Ra-
vat et al., 2007). They assumed that both the ontol-
ogy and the corpus of documents belong to the same
domain. Oukid et al. proposed an aggregation opera-
tor Orank (OLAP rank) that aggregated a set of docu-
ments by ranking them in a descending order using a
vector space representation (Oukid et al., 2013).
The approaches based on graphs use keywords to
construct a keyword graph. The nodes represent the
keywords obtained after pre-processing, candidate se-
lection and edge representation. After the graph rep-
resentation step, different types of keyword ranking
approaches have been applied. The first approach pro-
posed in (Mihalcea and Tarau, 2004) is called Tex-
tRank, where graph nodes are the keywords and edges
represent the co-occurrence relations between key-
words. The idea is that, if a keyword gets link to a
large number of other keywords, this keyword will be
considered as important. Bouakkaz et al. (Bouakkaz
et al., 2014) propose a new method which performs
aggregation of keywords of documents based on the
graph theory. This function produces the main ag-
gregated keywords out of a set of terms representing
a corpus. Their aggregation approach is called TAG
(Textual Aggregation by Graph). It aims at extracting
from a set of terms a set of the most representative
keywords for the corpus of textual document using a
graph. The function takes as input the set of all ex-
tracted terms from a corpus, and outputs an ordered
set, containing the aggregated keywords. The process
of aggregation goes through three steps: (1) Extrac-
A New Tool for Textual Aggregation In Information Retrieval
233