VISUALIZATION OF DOCUMENT CLUSTERS

An Interactive Visual Tool to Browse Textual Documents

Faryel Allouti, Mohamed Nadif

LIPADE, UFR MI, Paris Descartes University, 45, rue des Saints Pres, 75270, Paris, France

Benoˆıt Otjacques

Public Research Center - Gabriel Lippmann, Informatics, Systems and Collaboration Department

41, Rue du Brill, L-4422 Belvaux, Luxembourg

Keywords:

Information visualization, Clustering, Documents.

Abstract:

Handling collections of text documents has become a daily task for many professionals whatever their eco-

nomic sector or position in the organization. In many cases, little metadata is added to the documents, which

makes it difﬁcult to automatically derive a semantic structure within the collection. This paper describes a new

tool that combines the clustering and the visualization paradigms to help a user identify similar documents in

an unstructured collection. Several clustering algorithms can be used to identify clusters of documents that

are subsequently displayed on a plane. In this work, we use the Classiﬁcation EM algorithm. The original-

ity of our approach is to allow the user to reﬁne the clustering process interactively by means of a visual

analysis of the results of the intermediate steps. In addition, the tool also shows some enriched views of the

content of documents and allows the user to include a semantic analysis based on personal knowledge to the

computer-based clustering process.

1 INTRODUCTION

Considering the exponentially increasing amount of

digital data, the automatic processing of collections

of documents has become a major issue in both the

academic and the business contexts. This trend can

be illustrated by several examples. The massive in-

dexation of pages underpinning the amazing success

of Google is perhaps the most famous one in the last

few years.

At the individual level, many of us regularly spend

time searching for a speciﬁc document in our ﬁle sys-

tems. We regularly browse our storage disks to ﬁnd

out the numerous versions of a report and identify

the right one. As few of us rigorously build and

maintain semantically relevant and cross-referenced

storage structures (allowing mapping among differ-

ent systems), search operations usually take too much

time. In many cases, other tools (e.g. e-mail) are

diverted to become secondary ﬁle systems that also

need to be explored. Unfortunately all these systems

do not share common access points to the data (e.g.

indexes, key words). This issue known as the Frag-

mentation Problem (cf. Gonalves and Jorge (2008))

raises additional challenges regarding document re-

trieval.

Thus, full text search (potentially in several dis-

tinct systems) is often the only available solution to

ﬁnd out the needed ﬁle(s). Unfortunately, ﬁnding the

most differential words to identify the searched ﬁle(s)

may be challenging. If several versions of the same

ﬁle are present in the collection this process may be-

come even more difﬁcult. In some cases, the ﬁles are

tagged with metadata and the search process can rely

on this complementary information. Unfortunately

again, metadata are often not sufﬁcient to describe the

content of a document and summarizing a text with a

couple of relevant words is not that easy. Moreover

many documents do not have any semantic metadata

at all. In addition Blanc-Brude and Scapin (2007)

have pointed out that users only partially recall the

keywords describing their own documents.

This paper focuses on this type of problem: ﬁnd-

ing similar documents in a set and subsequently an-

alyzing in details the elements of this subset to iden-

tify a speciﬁc item. We propose to investigate this

issue with a mixed approach based on the respec-

tive strengths and weaknesses of humans and com-

157

Allouti F., Nadif M. and Otjacques B. (2010).

VISUALIZATION OF DOCUMENT CLUSTERS - An Interactive Visual Tool to Browse Textual Documents.

In Proceedings of the International Conference on Imaging Theory and Applications and International Conference on Information Visualization Theory

and Applications, pages 157-160

DOI: 10.5220/0002838501570160

 SciTePress

puters. Computers are efﬁcient for automated well-

deﬁned tasks but they can not easily tackle the infor-

mal knowledge that users have concerning the whole

context of a document. Symmetrically humans con-

tinuously process pieces of information involving un-

certainty, tacit knowledge and/or informal informa-

tion. On the basis of this observationwe suggest com-

bining some computer-based clustering steps with

complementary steps on the basis of the visual intelli-

gence of humans. The partial results of the computer-

based clustering is reﬁned by the user in an iterative

loop. Designing and experimenting such a powerful

tool is our global purpose.

2 THE STATE OF THE ART

Various computer-based approaches have been pro-

posed in order to tackle the vast amount of digital

documents that are accessible to a user in a given con-

text. Paulovich and Minghim (2008) have proposed a

technique called HiPP. It is a hierarchical point place-

ment strategy for displaying, interacting, and organiz-

ing large multidimensional data sets. Barreau (1995)

identiﬁes four aspects of Personal Information Man-

agement: acquisition, organization, maintenance and

retrieval. Our study focuses on the retrieval phase.

Indeed, Blanc-Brude and Scapin (2007) have recently

highlighted the need for new tools to retrieve personal

documents. They explain that the presentation of the

item’s category is likely to increase the recall of that

item and that it would be interesting to ﬁnd a method-

ology that would elicit a greater proportion of sponta-

neous recall. We especially study how to group sim-

ilar documents into classes and to ﬁnd one or many

speciﬁc items within these classes.

Gonalves and Jorge (2008) propose to retrieveper-

sonal information not only on the basis of arbitrary

classiﬁcations in hierarchies but to report to a wider

range of contextual data. We follow these recommen-

dations by including successive classiﬁcation and vi-

sualization steps together with full text search feature.

3 OUR APPROACH

Our initial purpose is to develop a tool that organizes

documents into classes, then to visualize these classes

in order to assist user in analysing and ﬁnding ﬁles

or informations in ﬁles. We limit our research to the

textual content of ﬁles excluding other elements (e.g.

pictures). However, we acknowledge the need to in-

Figure 1: Global strategy.

clude other types of content in the future. We elab-

orated a global strategy that consists in three main

parts: data processing, clustering and visualization

of the results with a projection algorithm. At each

step, we offer to the user some appropriate views de-

pending on the ongoing tasks: visual representation

of the structuration of the dataset (i.e. contingency ta-

ble, dissimilarity matrix), results of the clustering step

or results of the projection algorithm (see Figure 1).

3.1 Data Processing

In this step we handle the selection of documents.

We assume that the documents belong to predeﬁned

classes. A pre-processing is necessary to obtain data

structured in the form of a contingency table: (ﬁles ×

words) where each cell corresponds to the frequency

of a word in a ﬁle.

The data can also be presented in the form of a

dissimilarity matrix (ﬁles × ﬁles), where the value of

each Cij cell corresponds to the distance between ﬁle

i and ﬁle j. In our research work, the dissimilarity ma-

trix is computed using the chisquare distance. This

distance is justiﬁed for several reasons, in particular

because of the property of distributional equivalence,

which implies stable results when agglomerating ele-

ments with similar proﬁles.

3.2 Clustering Step

As previously mentioned, we focus on textual data.

We have adopted a mixture model framework. In

this context, we propose to use the multinomial mix-

ture models (Govaert and Nadif, (2007)) to cluster

the set of ﬁles. Setting this model in the classiﬁca-

tion maximum likelihood (CML), the authors showed

that this model, when the proportions of the clusters

are assumed equal, is approximately associated to the

within-cluster criterion using the chisquare distance.

Therefore, we use the same distance in the cluster-

ing step and in the visual projection. In the general

IVAPP 2010 - International Conference on Information Visualization Theory and Applications

158

context, the criterion optimized is the complete data

log-likelihood.

3.3 Visualization

The visualization of obtained clusters is performed by

the MultiDimensional Scaling (MDS) using the Dif-

ference of Convex functions Algorithm (DCA) to op-

timize the classical stress criterion of MDS (Le Thi

and Pham, (2001)). In the following, we call this al-

gorithm MDSDCA.

MDSDCA is an iterative algorithm, to avoid a

random initialization we proposed in (Allouti et Al.,

(2009)) an initialization approach based on the parti-

tion obtained by the CEM algorithm, the dissimilarity

matrix and bearing. The proposed initialization ap-

proach avoids performing a large number of unneces-

sary iterations.

MDSDCA determines the coordinates of the dif-

ferent documents on a plane. Based on these coordi-

nates, we propose two graphical presentations. The

ﬁrst one consists in positioning the different docu-

ments on a plane using the cartesian coordinates. The

second one consists in positioning documents on a

plan using polar coordinates.

4 VISUALIZATION TOOL

We emphasize that our tool provides two graphical

interfaces. The ﬁrst is intended for a user. In this

case the user does not know the classes. The sec-

ond graphical interface is intended for an expert (the

researcher). In this case, the researcher has an idea

about classes and he wants to analyze and look closely

at these classes. We illustrate on Figure 2 the sec-

ond interface. This interface is composed of 6 parts.

The ﬁrst part allows the user to choose the clustering

algorithm. Regarding the polar coordinates, the sec-

ond part allows to position the documents in a circle,

half circle or quarter circle. Part-3 allows to visualize

the contingency table and the dissimilarity matrix. It

allows also to search documents containing a word,

and it allows to cluster and to project a selected sub-

set. Part-4 illustrates the results of the projection algo-

rithm. Part-5 allows to position documents around a

selected document (on the part-4) respecting the dis-

tances between the selected document and the other

documents. On part-6, we display the contents of the

selected document.

We illustrate our tool on a sample of Classic3

(ftp://ftp.cs.cornell.edu/pub/smart/). Classic3 is a

database composed of 3893 documents obtained by

Figure 2: Interface for an expert.

combining the summaries of CISI, CRANFIELD and

MEDLINE.

4.1 Visualization of the Structure of

Data

The visualization of the structure of data allows to

give an overview of the data. Thus, regarding the

contingency table, the expert can visualize the words

shared by documents, one can also determine if the set

of documents share many words. One may also have

an idea about the frequency of words in each docu-

ment based on the nuance of color. The darker the

color the higher the frequency. Visualization of the

dissimilarity matrix allows to determine ﬁles that are

close to each other in term of the chisquare distance.

4.2 Visualization of the Result of the

Projection Algorithm

Our tool gives an overview of the data, ﬁlters and de-

tails on demand. That is to say that ﬁrst we display the

results of the projection algorithm (part-4) and if the

expert wants details on a ﬁle, he can select a ﬁle and

view its content (part-6). He can also visualize the po-

sitioning of other ﬁles from the selected ﬁle (part-5).

In this case, he may choose to view either all ﬁles or

ﬁles arised from its Out class or ﬁles arised from its

In class.

We have also added visual cues in the text view in

order to help the user identify the most relevant words

in the document as well as in the whole class. For

example, we used brown to illustrate the stopwords.

Green illustrates the words shared by the selected ﬁle

and ﬁles belonging to its Out class. Black illustrates

words belonging only to the selected ﬁle.

We add to our tool possibility to visualize dynam-

ically the result of the projection algorithm. The dy-

namic visualization allows user to select a ﬁle and

VISUALIZATION OF DOCUMENT CLUSTERS - An Interactive Visual Tool to Browse Textual Documents

159

view in real-time ﬁles that are closest in terms of ei-

ther euclidian distance or chisquare distance. Thus,

the user can analyze these ﬁles in more detail (see

Figure 3).

One can also select a subset to be clustred and

to be projected. It is also possible to select another

subset from the previous one. Thus, the result of the

computer-based clustering is reﬁned by the user in an

iterative loop (see Figure 3).

4.3 Visualization of the

Pre-post-clustering Relationship

Visualization of the pre-post-clustering allows to

show four cases: ﬁles that were in the same In classes

and are in the same Out classes, ﬁles that were not

in the same In classes and are not in the same Out

classes, ﬁles that were in the same In classes but are

not in the same Out classes, ﬁles that were not in the

same In classes and are in the same Out classes.

Speciﬁcally, this part allows to visualize ﬁles

wrongly classiﬁed.

4.4 Information Retrieval

Concerning information retreival, our tool allows to

search documents by specifying a word. To illustrate

the result of the research process, we associated color

to the ﬁles containing the speciﬁed word. We indicate

to the user the number of ﬁles containing the speciﬁed

word (see Figure 2). The user can visualize only the

ﬁles containing the speciﬁed word and thus analyse

in more detail these ﬁles. By clicking on one of these

ﬁles, one can, for example, visualize the occurrence

of the speciﬁed word in the selected ﬁle.

5 CONCLUSIONS

In this paper we presented an interactive tool that

combines the clustering and visualization methods for

textual data. The tool allows to identify similar doc-

uments into an unstructured collection. Speciﬁcally,

we used the multinomial mixture model to cluster

the documents, and MDSDCA for visualization. The

originality of our approach is to allow the user to in-

teractively reﬁne the clustering process based on vi-

sual analysis of the results of the intermediate steps.

In addition, our tool also offers visual cues in the text

view in order to help the user identify the most rele-

vant words in the document as well as in the whole

class. Tool also shows some enriched views of the

content of documents by allowing the user to include

Figure 3: Clustering and projection of a subset of docu-

ments.

a semantic analysis based on personal knowledge to

the computer-based clustering process.

We will illustrate our tool on other real data.

ACKNOWLEDGEMENTS

The research work reported in this paper has been

supported by a grant of the National Research Fund

(FNR) of Luxembourg. For this, we thank the FNR.

REFERENCES

Allouti, F., Nadif, M., Le Thi, H. A., Otjacques, B., 2009.

Mixture model and MDSDCA for textual data. In

Proceedings of the 6th International Conference on

Cooperative Design, Visualization and Engineering

(CDVE 2009), 20-23 September 2009, Luxembourg,

published in Cooperative Design, Visualization, and

Engineering, Lecture Notes in Computer Science, vol.

5738, Springer, Berlin, Allemagne, pp. 240-244.

Blanc-Brude, T., Scapin, D., 2007. What do People Recall

about their Documents ? Implications for Desktop

Search Tools. In IUI’07, 2007 International Confer-

ence on Intelligent User Interfaces, ACM Press.

Barreau, D., 1995. Context as a factor in personal informa-

tion management systems. In Journal of the American

Society for Information Science, 46(5), 327-339.

Gonc¸alves, D., Jorge, J. A., 2008. In Search of Personal

Information: Narrative-Based Interfaces. In Proceed-

ings of the 13th international Conference on intelli-

gent User interfaces (Gran Canaria, Spain, January

13-16, 2008). IUI’08. ACM, New York, NY, 179-188.

Govaert, G., Nadif, M., 2007. Clustering of contingency ta-

ble and mixture model. European Journal of Opera-

tional Research. 36, 1055-1066.

Le Thi, H. A., Pham Dinh, T., 2001. D.C. Programming

Approach for Solving the Multidimensional Scaling

Problem. Nonconvex Optimizations and Its Applica-

tions, Kluwer Academic Publishers. 231-276.

Paulovich, F. V., Minghim, R., 2008. HiPP: A novel hier-

archical point placement straegy and its application to

the exploration of document collections. IEEE Trans-

actions on Visualization and Computer Graphics 14,

6 (2008), 1229-1236.

IVAPP 2010 - International Conference on Information Visualization Theory and Applications

160