A Study on the Role of Similarity Measures in Visual Text Analytics

F. San Roman S.

, R. D. de Pinho

, R. Minghim

and M. C. F. de Oliveira

Instituto de Ci

encias Matem

aticas e de Computac¸

ao, Universidade de S

ao Paulo, S

ao Carlos, Brazil

Minist

erio da Ci

encia, Tecnologia e Inovac¸

ao, Bras

ılia, Brazil

Keywords:

Visual Text Analytics, Visual Text Mining, Vector Space Model, High-dimensional Data Visualization and

Multidimensional Projections.

Abstract:

Text Analytics is essential for a large number of applications and good approaches to obtain visual mappings

of text are paramount. Many visualization techniques, such as similarity based point placement layouts, have

proved useful to support visual analysis of documents. However, they are sensitive to data quality, which, in

turn, relies on a critical preprocessing step that involves text cleaning and in some cases term detecting and

weighting, as well as the deﬁnition of a similarity function. Not much has been discussed on the effect of

these important similarity calculations in the quality of visual representations. This paper presents a study on

the role of different text similarity measurements on the generation of visual text mappings. We focus mainly

on two types of distance functions, those based on the well-known text vector representation and on direct

string comparison measurements, comparing their effect on visual mappings obtained with point placement

techniques. We ﬁnd that both have their value but, in many circumstances, the vector space model (VSM)

is the best solution when discrimination is important. However, the VSM is not incremental, that is, new

additions to a collection force a recalculation of the whole feature space and similarities. In this work we also

propose a new incremental model based on the VSM, which is shown to present the best visualization results

in many conﬁgurations tested. We show the evaluation results and offer recommendations on the application

of different text similarity measurements for Visual Text Analytics tasks.

1 INTRODUCTION

Producing visualizations from textual documents re-

quires a pre-processing step in which similarity evalu-

ation plays a fundamental role. Often, a Vector Space

Model (VSM) (Salton et al., 1975) that considers the

frequency of relevant words is created, over which co-

sine distance approximates text dissimilarity. Little is

known about how this pre-processing affects the out-

come of text visualization techniques.

The VSM poses many limitations for visualiza-

tion purposes, as it fails to capture semantics im-

plicit in the relationships among words and terms.

Moreover, in building a meaningful VSM several pre-

processing operations require parameter settings that

may affect the outcome considerably. Resulting mod-

els are typically described by very high-dimensional

feature spaces, which suffer from drawbacks globally

referred to as ‘the curse of dimensionality’ (Huang

et al., 2005) that result in low discrimination power

by most techniques.

VSM models may be avoided altogether by us-

ing direct string comparison functions (Telles et al.,

2007). Adding documents to a collection does not

impact the underlying model, since it sufﬁces to com-

pare the new document with the existing ones. Many

such measures have been deﬁned, for different pur-

poses and applications. Again, there is little record on

how their choice affects text analytics, visual or other-

wise, and the question remains on how they compare

with cosine distances calculated over the VSM.

We are concerned with assessing how the choice

of a (dis)similarity function affects the output of

content-based visualization techniques. We consider

visualizations that lay out documents as points on

a plane based on their similarity, to verify how the

choice of a similarity function affects their quality

in terms of discriminating groups of text ﬁles with

highly related content. We also address the additional

limitation that computing a VSM requires the com-

plete collection to be available a priori, rendering

it incapable of handling streaming text. This paper

investigates these issues, reporting on the following

questions:

1. are string distance measures suitable for text

visualizations based on similarity? which

429

San Roman S. F., D. de Pinho R., Minghim R. and C. F. de Oliveira M..

A Study on the Role of Similarity Measures in Visual Text Analytics.

DOI: 10.5220/0004214004290438

In Proceedings of the International Conference on Computer Graphics Theory and Applications and International Conference on Information

Visualization Theory and Applications (IVAPP-2013), pages 429-438

ISBN: 978-989-8565-46-4

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

measures may be considered and how their

choice affects the visualizations?

2. how do string distances compare with the tra-

ditional cosine distance computed over the

VSM regarding visualization quality?

3. is it possible to represent a dynamic collec-

tion, updating a vector model as documents

are added? how visualizations built from such

a model compare to those obtained with the

conventional VSM and with string distances?

2 RELATED WORK

The VSM with tf-idf measure of terms deemed rel-

evant is the typical input representation to most text

visualization and text clustering techniques. Visu-

alizations may be derived directly from such repre-

sentations, e.g. as in various Multidimensional Scal-

ing (MDS) approaches (Wise et al., 1995; Paulovich

et al., 2008; Paulovich and Minghim, 2008). Hier-

archical similarity-based layouts have also been pro-

posed and illustrated for visualizing textual docu-

ments, e.g. the Neighbor-Joining tree (Cuadros et al.,

2007).

The Incremental Board - incBoard (Pinho et al.,

2009) and the Incremental Space (Pinho et al., 2010)

also derive text collection visualizations. They are, by

design, more suited for handling dynamic collections

in which documents are added gradually. These tech-

niques inspired the Incremental Vector Space Model

(iVSM) introduced in Section 4.

Alternatively, vector models may be derived with

topic extraction techniques such as Latent Semantic

Analysis (Landauer et al., 2007) and Latent Direchlet

Allocation (LDA) (Blei et al., 2003), usually produc-

ing lower-dimensional feature spaces. Topics are also

often extracted to annotate similarity-based visualiza-

tions, based, for instance, on LDA (Wei et al., 2010)

or on association rule mining (Lopes et al., 2007) to

derive topic-oriented views.

Streamit shows real-time views of streaming doc-

uments (Alsakran et al., 2012) built from a dynamic

2D similarity layout computed with a fast imple-

mentation of a force-based projection. Handling

streams poses additional challenges to text visualiza-

tions based on content similarity. In this solution text

documents are described by dynamic keyword vec-

tors, and in computing the cosine similarity a param-

eter I

is introduced to account for the importance of

a keyword k at a particular time. Importance may be

determined automatically based on various parame-

ters and it may be modiﬁed by users based on their

perception. LDA is employed to reduce feature space

dimensionality. Each topic is associated with a set of

keywords, and documents are represented by a vector

of the probable weights of their topics. Besides reduc-

ing dimensionality, the topics are at a higher semantic

level than terms and likely to produce more meaning-

ful document clusters. However, the topic model is

extracted from an existing similar collection, as the

collection displayed is not available initially.

We are unaware of previous studies on how the

choice of the similarity function affects the outcome

of text visualizations. There are, however, studies

that report comparisons of string distance functions

in other application domains. (Cohen et al., 2003)

compare the performance of several distance metrics

for the tasks of matching and clustering lists of entity

names. SecondString is an open-source Java toolkit

that incorporates several string metrics for matching

names and records, including some novel hybrids of

well-known methods. Authors computed three eval-

uation measures, the non-interpolated average preci-

sion, the maximum F1 score and the interpolated pre-

cision at eleven recall levels. In general, the best

results were obtained with the hybrid distances pro-

posed by them.

(Kempken et al., 2006) compare the performance

of selected distances to support retrieval of historical

spelling variants in historical text documents. Ex-

periments were conducted on a dataset of historical

spellings manually collected from historical German

documents, containing a list of word pairs. Distances

were evaluated with the precision and recall mea-

sures, and the best performance was obtained with a

stochastic distance.

3 STRING SIMILARITY

MEASURES

String distance functions map a pair of strings X and

Y to a real number r, where higher values of r indicate

greater dissimilarity between X and Y. String similar-

ity functions, on the other hand, return higher values

for r as X and Y are more similar, and distances may

be generated taking the value 1 −r. In this section we

brieﬂy present string distance and similarity functions

employed in this study.

One important class of string distance functions

are the so-called edit distances, which return the min-

imum number of editing operations required to trans-

form a string into the other. Typical editing oper-

ations are character insertion, deletion and substitu-

tion, and each one is assigned a cost. Two strings X

and Y may also be considered as multisets of words

(substrings or tokens), and several token-based mea-

IVAPP2013-InternationalConferenceonInformationVisualizationTheoryandApplications

430

Table 1: Token-based measures. Function Q(·) returns the

number of tokens in the input string, P(·) returns the num-

ber of characters, qG(·) returns the number of substrings of

length q, XY stands for a concatenation of X and Y, and C(·)

returns the size, in bytes, of the compressed input string.

Name Similarity

Dice’s Coefﬁcient

2 ∗ Q(X

∩Y

)

Q(X

) + Q(Y

)

(1)

Cosine

Q(X

∩Y

)

Q(X

) ∗ Q(Y

)

(2)

Matching Coefﬁcient

Q(X

∩Y

)

max

{

Q(X

), Q(Y

)

}

(3)

Overlap Coefﬁcient

P(X

∩Y

)

min

{

P(X

), P(Y

)

}

(4)

Q-gram

2 ∗ qG(X

∩Y

)

qG(X

) + qG(Y

)

(5)

NCD

C(XY ) − min

{

C(X ),C(Y)

}

max

{

C(X ),C(Y)

}

(6)

NCDs

NCD(X , Y ) +

NCD(X , X) + NCD(Y, Y )

(7)

sures are deﬁned. Given two token sets X

and Y

de-

rived from X and Y several similarity functions may

be deﬁned, as described in Table 1. In Section 6 we

compare these and other distance measures in gener-

ating (dis)similarity-based visualizations of text col-

lections.

4 iVSM: A DYNAMIC VECTOR

SPACE MODEL

The Incremental Vector Space Model (iVSM) has been

proposed to represent text documents of an incremen-

tal collection (Pinho et al., 2010). As in the orig-

inal VSM, each dimension represents the tf-idf fre-

quency of a relevant term. As not all documents are

known a priori, an initial representation of the un-

known collection is approximated from the VSM con-

structed for a similar known collection (e.g., news, or

scientiﬁc papers). This approximate initial represen-

tation is called a ‘language model’, and provides an

initial set of relevant terms, their frequency (T F) and

the number of documents in which they occur (DF).

The iVSM is constructed by continuously updating

the language model (the TF and DF term countings)

as new documents are added to the collection (or ex-

isting documents are removed).

The process is illustrated with a hypothetical col-

lection with N documents and M terms, for which a

VSM has been created, as shown in Table 2, where

i j

stands for the frequency count of term t

in docu-

ment d

. A so-called language model for this collec-

tion is deﬁned as shown in Table 3. DF

is the number

documents that include the term j, and T F

is the fre-

quency of term j, as computed by Eq. 8.

Table 2: Vector space model (VSM) representation of a col-

lection with N documents. Rows refer to documents and

columns to terms that occur in the documents: α

i j

denotes

the frequency of term t

in document d

. . . t

. . . α

Table 3: Language model of the collection: each row rep-

resents a VSM term, as shown in Table 2. Column T F in-

forms overall term frequencies and column DF informs how

many documents include the corresponding term.

Term TF DF

T F

T F −IDF

∑

i=1

i j

log

(8)

The language model in Table 3 provides the de-

parting point to build the iVSM for a dynamic col-

lection. This is done by incrementally updating the

initial language model whenever a new document ar-

rives. The underlying rationale is very simple: if a

term t present in the incoming document also occurs

in the language model, its corresponding T F and DF

values are incremented accordingly (DF only once

for each document). Otherwise, the new term is in-

troduced in the language model, and its T F

and DF

values are initialized, i.e., DF ← 1, T F ← 1.

If terms are continuously added, the dimension-

ality of the vector space keeps increasing, which is

not desirable. In order to keep dimensionality under

control, the set of relevant terms is updated by setting

appropriate Luhn Cut thresholds, according to Eqs. 9

and 10, where N

stands for the maximum value of

DF in the current language model. Finally, the iVSM

for a particular document is computed considering the

tf-idf count of each term t

currently in the language

model, as presented in Eq. 11, where t f

i j

stands for

the number of occurrences of term t

in document D

lower











3, if LC

lower

< 3

llc% of N

, if LC

lower

> 5% of N

or LC

lower

< 1% of N

(9)

upper



luc% of N

, if LC

upper

< 90% of N

or LC

upper

> N

(10)

AStudyontheRoleofSimilarityMeasuresinVisualTextAnalytics

431

iV SM

i j











t f

i j

∗ log

, if DF

≥ LC

lower

or DF

≤ LC

upper

zero, otherwise

(11)

with llc and luc standing for the chosen lower

and upper cut Luhn’s thresholds, respectively. In the

studies reported in Section 6 these were set to 2%

and 95%, respectively. When applying this model to

streaming text similarity measures may be updated as

needed by the underlying layout technique. Its usage

in tandem with incremental algorithms, e.g., incBoard

and incSpace, was envisioned to require only partial

recalculation of similarity measures as the collection

changes over time, as required by those algorithms.

5 STUDY SET-UP

Our goal is to investigate how the choice of repre-

sentation model and dissimilarity function affect the

quality of layouts output by point-placement tech-

niques applied to textual collections. Assessing qual-

ity of point-placement layouts is a difﬁcult issue, as

analysis depends on the tasks the layout is meant to

support. We believe important tasks are related with

the layout’s capability of preserving meaningful text

clusters, i.e., to which extent it favors data grouping

and group segregation; alternatively analysts may de-

sire layouts capable of preserving as much as possible

the original distances, or dissimilarity relations.

Some objective quality measures may be applied

to compare different layouts in this context. We con-

sider the Silhouette Coefﬁcient (Tan et al., 2005), that

attempts to quantify the quality of clusters identiﬁable

in the feature space or in a layout derived from it, and

the Neighborhood Hit curve (Paulovich et al., 2008),

which attempts to quantify to which extent a layout

preserves known classes.

The silhouette coefﬁcient SC of a cluster is com-

puted as the average of the silhouette coefﬁcient com-

puted for its individual points. The silhouette of a

particular data point p

, belonging to a cluster C

computed according to Equation (12):

− a

)

max(a

, b

)

(12)

where a

is the average distance from p

to all the

other data points in C

and b

is the minimum aver-

age distance from p

to the other clusters, obtained

after computing the average distance from p

to all

the data points in a cluster C

, for all j 6= i. SC takes

values in the range [−1, 1]. Negative values indicate

that a

> b

, whereas the opposite is desirable. Notice

that SC assumes its maximum value when a

= 0.

The Neighborhood Hit (NH) is a curve that con-

veys the layout’s capability of preserving class struc-

ture. The NH value for an individual data point is

computed by counting number of its neighbors on the

projected layout that belong to its same label or class.

The curve is obtained by averaging the NH measure

computed for all individual data points, for a varying

number of neighbors to the point, from 1 to a maxi-

mum.

We compared layouts obtained with two rep-

resentative point-placement techniques. The Least

Square Projection (LSP) (Paulovich et al., 2008) is

a multidimensional projection technique, whereas the

Neighbor-Joining Tree (Cuadros et al., 2007) gener-

ates a hierarchy from a given dissimilarity matrix.

LSP attempts to generate a layout that preserves

neighborhood groupings in the feature space. It ﬁrst

obtains a subsample of the data points, called con-

trol points, that is hopefully representative of its over-

all spatial distribution, and then computes neighbor-

hoods for this sample points. The control points are

projected ﬁrst with a precise technique, and their pro-

jected coordinates, plus the neighborhoods, provide

information to build a linear system model that is

solved to obtain the projected coordinates of all data

points. LSP takes as input parameters a pairwise dis-

tance matrix computed for the collection, the number

of control points, and the number of neighbors to con-

sider in deﬁning neighborhoods.

The Neighbor-Joining (NJ) tree is inspired on al-

gorithms for building phylogenetic trees in Biology.

It builds a tree that describes ancestrality relations be-

tween species, given a matrix of pairwise distances

between them. Then, a tree layout algorithm is em-

ployed to display the resulting hierarchy. NJ takes

as input a pairwise distance matrix of the collection

and requires no additional parameters. Whereas LSP

shows a global view that attempts to convey mean-

ingful groups of texts that have similar content, the

branches and sub-branches in the tree view allow a

user to infer levels or degrees of similarity between

the texts.

Studies were conducted on textual datasets

1 2

scientiﬁc papers and news articles, summarized in Ta-

ble 4.

We computed 15 distinct pairwise dissimilarity

matrices for the datasets, using the following string

distance or similarity functions

: Block, Jaccard,

Cosine, Euclidean, JaroWrinkler, Dice Coefﬁcient,

http://infoserver.lcad.icmc.usp.br/infovis2/DataSets

http://www.daviddlewis.com/resources/testcollections/reuters21578

http://sourceforge.net/projects/simmetrics

IVAPP2013-InternationalConferenceonInformationVisualizationTheoryandApplications

432

Table 4: Text datasets.

Name Description General # #

Content docs classes

CBR-ILP-IR case based reasoning, scientiﬁc 574 3

inductive logic programming papers

and information retrieval

news2011 RSS news feeds (AP, CNN, news 1,771 23

Reuters and BBC)

ReutersNews subset from Reuters21578 news 3,988 7

collection (newswire

articles)

Levenshtein, Matching Coefﬁcient, SmityWaterman,

Jaro, QGram, Soundex, NeedlemanWunch, Monge

and Overlap Coefﬁcient. Their choice was based on a

survey of existing alternatives for string comparison.

After inputting the distance matrices to LSP (con-

sidering two distinct conﬁgurations for the number

of control points and neighborhood size) and to the

NJ-tree, resulting layouts were compared to identify

the functions with the best results on the CBR-ILP-IR

data, by conducting a subjective evaluation of their

visual quality and also comparing their correspond-

ing NH curves. This preliminary analysis identiﬁed

ﬁve best performing string measures for further in-

vestigation, namely Cosine Similarity, Dice’s Coefﬁ-

cient, Matching Coefﬁcient, Overlapping Coefﬁcient

and QGram.

In all cases some text-preprocessing has been ap-

plied, which varied on different test cases, due to the

nature and goals of different functions. Luhn’s cut-

ting thresholds, stopwords removal and Porter stem-

ming were employed when appropriate, as detailed in

the Results section.

In a subsequent step, we compared the previous

ﬁve string measures, plus Normalized Compression

Distance (NCDs) (Telles et al., 2007), with the con-

ventional approach for generating similarity-based

layouts from text, namely the Cosine similarity ap-

plied over a VSM vector representation. Finally, we

included in the comparison the Cosine similarity ap-

plied over the iVSM model introduced in Section 4.

Precision results are shown in Section 6, processing

times are given in Table 5.

Table 5: Processing times (in seconds) for computing dis-

similarity matrices with the distinct string dissimilarity

functions.

Measure CBR-ILP-IR News2011 ReutersNews

Cosine Distance 750 41 2,331

Dice’s coefﬁcient 715 41 2,344

Matching’s coefﬁcient 1,588 73 4,761

Overlap’s coefﬁcient 758 41 2,319

Qgram Distance 16,744 1,215 52,877

NCDs 1,350 10,038 63,109

6 RESULTS

Figure 1 shows the layouts obtained with LSP and

with NJ using as input dissimilarity matrices com-

puted employing the cosine distance over the VSM

and iVSM representations, respectively, for the three

datasets considered. The LSP input parameters were

set to 57, 177 and 398 control points, respectively, for

CBR-ILP-IR, News2011 and ReutersNews, and to 15

nearest-neighbors in all cases. Figure 2 shows the cor-

responding NJ tree layouts, created with the NJ imple-

mentation by (Paiva et al., 2011), which is faster than

the original one (Cuadros et al., 2007)

. In the visual-

izations each circle represents a document and color

maps the document class. One may visually assess

the degree of class separation inspecting the spatial

distribution of colors in the LSP layouts, or the distri-

bution of colors in the branches and sub-branches of

the NJ-tree layouts.

(a) VSM (b) iVSM

(e) VSM (f) iVSM

Figure 1: LSP layouts generated for text datasets: CBR-

ILP-IR (top), News2011 (middle) and NewsReuters (bot-

tom), using the VSM and iVSM representations and the co-

sine distance. Circle color maps document class.

In order to generate the visualizations, textual data

was preprocessed with stopwords removal, Porter’s

stemming and deﬁnition of Luhn’s thresholds. We

removed the usual stopwords, such as articles and

http://infoserver.lcad.icmc.usp.br/infovis2/NeighborJoiningTree

AStudyontheRoleofSimilarityMeasuresinVisualTextAnalytics

433

prepositions, and also a few domain speciﬁc words

when handling scientiﬁc papers or news, e.g., for pa-

pers added stopwords included ‘press’, ‘proceedings’,

‘proc’, ‘vol’ and ‘year’. In generating the VSM mod-

els we set Luhn’s lower cut to 10, and applied no up-

per cut threshold. In generating the iVSM models,

Luhn’s thresholds were deﬁned according to Equa-

tions 9 and 10. For the CBR-ILP-IR data the start-

ing language model was generated from an available

data set of 2,814 scientiﬁc papers (All.zip) from mul-

tiple conferences and journals on Visualization, avail-

able at the same site as the CBR-ILP-IR data set. For

News2011 and Reuters the starting language model

has been computed from an existing collection with

news from April 2006 (AP BBC CNN Reuters.zip),

again available at the same site.

(a) VSM (b) iVSM

(e) VSM (f) iVSM

Figure 2: NJ-tree layouts for text datasets: CBR-ILP-IR

(top), News2011(middle) and NewsReuters (bottom), using

the VSM and iVSM representations and the cosine distance.

Circle color maps document class.

Figures 3, 4 and 5 show the neighborhood preser-

vation curves of the layouts depicted in the previous

ﬁgures, for each dataset. One observes that for CBR-

ILP-IR the iVSM model does a considerably better

job as far as neighborhood preservation is concerned,

both for LSP and NJ layouts. This is not true for the

news collections: in News2011 LSP with VSM does

better, whereas both VSM and iVSM curves relative

to the NJ layouts are very similar, although iVSM

does slightly better. For NewsReuters NJ with iVSM

does better, whereas LSP with iVSM performs better

up to 7 neighbors, then VSM starts doing better.

(a) LSP

(b) NJ

Figure 3: NH graphs of LSP and NJ layouts of CBR-ILP-IR

built with the VSM and iVSM models and cosine similarity.

(a) LSP

(b) NJ

Figure 4: NH graphs of LSP and NJ layouts of News2011

built with the VSM and iVSM models and cosine similarity.

We also compared the neighborhood preservation

capability of layouts obtained using distance matri-

ces computed with distinct string similarity measures,

plus the cosine similarity computed over the VSM and

iVSM models, for the three datasets.

Results are shown in Figure 6 for the CBR-ILP-

IR data. We considered two conﬁgurations of LSP,

IVAPP2013-InternationalConferenceonInformationVisualizationTheoryandApplications

434

(a) LSP

(b) NJ

Figure 5: NH graphs of LSP and NJ layouts of NewReuters

built with the VSM and iVSM models and cosine similarity.

with 57 and 177 control points, both with 15 nearest-

neighbors. The text preprocessing applied varied de-

pending on the dissimilarity measure employed. In

generating the VSM and iVSM models we applied

general and domain speciﬁc stopword removal and

no stemming. For VSM a lower Luhn’s cut was set

to 10 and no upper cut was adopted; for iVSM the

thresholds were computed automatically as deﬁned

in Equations 9 and 10, and the language model has

been computed from the same All.zip dataset. For the

string distance matrices, pre-processing procedures

also varied. General and speciﬁc stopwords were re-

moved from the input strings when using the string-

based Cosine distance, as well as Dice’s Coefﬁcient,

Overlap Coefﬁcient and Qgram. No stopword re-

moval was applied when using the Matching Coefﬁ-

cient and the NCDs measures. The choice of applying

(or not) stopwords removal has been made after veri-

fying which alternative produced the best NH curves.

In the ﬁrst LSP conﬁguration, shown in Fig-

ure 6(a), best results regarding class segregation

capability were obtained with the cosine distance

over the iVSM model (referred to in the ﬁgures as

iVSM cosine) and with string-based Dice’s Coefﬁ-

cient. The string-based Cosine also did well, the three

graphics show curves with values above 0.9. De-

spite their inferior performance as compared to the

previous ones, all the other distance measures pro-

duced curves with values above 0.8. The curves of

the second LSP conﬁguration (Figure 6(b)) shows that

best results were achieved with string-based Overlap

Coefﬁcient and with iVSM cosine and VSM cosine

– again all curves roughly remaining above the 0.9

(a) LSP (cp:57 - nn:15)

(b) LSP (cp:177 - nn:15)

Figure 6: NH graphs of LSP and NJ layouts of CBR-ILP-

IR obtained with 8 distinct distance matrices: 6 string func-

tions plus the VSM and iVSM with cosine dissimilarity.

threshold. The worst results were given by string-

based Matching Coefﬁcient and Qgram. For the NJ

layouts results are quite different: the best perform-

ing measures are string-based, namely Overlap Co-

efﬁcient, Qgram, NCDs and Cosine. VSM cosine

and string-based Matching Coefﬁcient displayed the

worst performances. iVSM cosine did considerably

better than VSM cosine, and although not top ranked

it comes close to the top ranked ones.

For the News2011 collection we employed LSP

with 177 control points and 15 nearest-neighbors,

and with 150 control points and 20 nearest-neighbors.

The resulting NH graphs, for the LSP (two versions)

and NJ layouts are shown in Figure 7. Preprocess-

ing steps were the same as for CBR-ILP-IR, and the

language model for iVSM has been computed from

the AP BBC CNN Reuters.zip dataset. As for the

string distances, general and speciﬁc stopwords re-

moval was employed for Dice’s Coefﬁcient, Matching

Coefﬁcient, Overlap Coefﬁcient and Qgram. No stop-

word removal was applied to the string-based Cosine

and the NCDs distance.

For the ﬁrst LSP conﬁguration (Figure 7(a)) best

results were obtained with cosine distance over the

AStudyontheRoleofSimilarityMeasuresinVisualTextAnalytics

435

(a) LSP (cp:177 - nn:15)

(b) LSP (cp:150 - nn:20)

Figure 7: NH graphs of LSP and NJ layouts of News2011

obtained with 8 distinct distance matrices: 6 string func-

tions plus the VSM and iVSM with cosine dissimilarity.

iVSM and VSM models and string-based Qgram,

which all show curves with values above 0.73. The

string-based Dice’s Coefﬁcient, Matching Coefﬁcient

and NCDs resulted in the worst performances (curves

staying bellow 0.6). In the second LSP conﬁguration,

shown in Figure 7(b), one notices that iVSM cosine,

VSM cosine and Qgram kept the best performances.

Note that in this conﬁguration NH curves outperform

slightly the ones in Figure 7(a). The worst results

were returned by string-based Cosine (identiﬁed in

the ﬁgures as cosine S) and NCDs. Moreover, all NH

curves produced by NJ (Figure 7(c)) achieve similar

precision values, above 0.75. Nonetheless, the best

results are again by iVSM cosine, VSM cosine and

Qgram.

For the NewsReuters collection we employed LSP

with 398 control points and 15 nearest-neighbors, and

with 200 control points and 20 nearest-neighbors.

The resulting LSP and NJ NH curves are in Figure

8. Pre-processing to generate the VSM and iVSM

models was applied as described for News2011. As

for the string distances, general and speciﬁc stopword

removal was employed for Cosine Distance, Overlap

Coefﬁcient, Qgram and NCDs measures. No stop-

word removal was applied to the string-based Dice’s

Coefﬁcient and Matching Coefﬁcient distance.

(a) LSP (cp:398 - nn:15)

(b) LSP (cp:200 - nn:20)

Figure 8: NH graphs of LSP and NJ layouts of NewsReuters

obtained with 8 distinct distance matrices: 6 string func-

tions plus the VSM and iVSM with cosine dissimilarity.

Figures 8(a) and 8(b) show the results for the two

LSP conﬁgurations. In both cases the iVSM produced

the highest precision values, followed by the VSM

and string-based Cosine, as the NH curves of the latter

two are the best in the second conﬁguration (curves

stay above 0.85). The worst results were given by

string-based Matching Coefﬁcient in the ﬁrst conﬁg-

uration (Figure 8(a)) and by NCDs in the second (Fig-

ure 8(b)). The best NH curves for the NJ layouts were

obtained with string-based Cosine, Dice’s Coefﬁcient

and the iVSM. String-based NCDs and Qgram dis-

played the worst performances. Despite their inferior

performance, these distance measures still produced

curves with values above 0.87.

Figure 9 shows the Silhouette Coefﬁcients (SC)

computed for the datasets considering different dis-

tance functions, in the original (blue bars) and in the

NJ-tree visual space (red bars). Distances in the NJ-

tree are computed considering path lengths. As dis-

cussed in Section 5, SC values closer to 1.0 indicate

highly cohesive and well separated clusters, accord-

IVAPP2013-InternationalConferenceonInformationVisualizationTheoryandApplications

436

Table 6: Ranking of NH curves of layouts obtained with string-based metrics and with the cosine similarity computed over

VSM and iVSM on the three datasets.

CBR-ILP-IR News2011 NewsReuters

Ranking LSP (1) LSP (2) NJ LSP (1) LSP (2) NJ LSP (1) LSP (2) NJ

◦

iVSM Dice’s C Overlap’s C. iVSM iVSM iVSM iVSM iVSM Cosine

◦

Dice’s C Cosine Qgram VSM VSM VSM VSM VSM Dice’s C

◦

Cosine Overlap’s C. NCDs Qgram Qgram Qgram Cosine Cosine iVSM

◦

Qgram iVSM Cosine Overlap’s C. Overlap’s C. Dice’s C Dice’s C Dice’s C Overlap’s C

◦

VSM VSM iVSM Cosine Matching’s C Cosine Overlap’s C. Overlap’s C. VSM

◦

Overlap’s C. NCDs Dice’s C NCDs Dice’s C Overlap’s C. Qgram Qgram Matching’s C

◦

Matching’s C Matching’s C VSM Matching’s C Cosine Matching’s C NCDs Matching’s C Qgram

◦

NCDs Qgram Matching’s C Dice’s C NCDs NCDs Matching’s C NCDs NCDs

0,00

0,05

0,10

0,15

0,20

0,25

0,30

0,35

0,40

NCDs

VSM_cosine

Matching's C

Dice's C

Qgram

cosine_S

iVSM_cosine

Overlap's C

Original

Projected

(a) CBR-ILP-IR

-0,10

-0,05

0,00

0,05

0,10

0,15

0,20

0,25

0,30

NCDs

Matching's C

Overlap's C

cosine_S

Dice's C

Qgram

VSM_cosine

iVSM_cosine

Original

Projected

(b) News2011

-0,05

0,00

0,05

0,10

0,15

0,20

0,25

Matching's C

Qgram

iVSM_cosine

NCDs

VSM_cosine

cosine_S

Dice's C

Overlap's C

Original

Projected

Figure 9: Silhouette Coefﬁcients of datasets, in the origi-

nal feature space and in the NJ-tree visual space (different

distance functions).

ing to the distance function considered. One observes

how the choice of the distance function affects the

grouping of elements based on similarity, in both the

original and the visual data spaces.

Ideally, a similarity-based layout should not de-

grade cluster quality, or even better it could actu-

ally improve it, favoring user perception of possibly

meaningful structures. Indeed, the ﬁgures show that

the NJ layout does improve cluster quality relative to

the feature space in some cases, in terms of cohesive-

ness and separation, as measured by the SC. Inspect-

ing the bar charts one notices that cluster quality in the

feature space may be poor, and some distance func-

tions are more effective than others in identifying bet-

ter quality clusters.

For the CBR-ILP-IR data, we notice that all dis-

tance functions actually contributed to a projected

layout with improved cluster quality. In fact, all

distances produced very low SC values in the fea-

ture space, always inferior to 0.1 with the exception

of iVSM cosine. SC value in the projected space

is better for all functions, with the Overlap Coefﬁ-

cient distance doing the best job in this matter. In

the News2011 data, again SC values in the feature

space are low and improve in the projected layouts,

with the exception of layout obtained with the NCDs.

The picture is quite different in the NewReuters data,

however: most distances produce worse SC values in

the projected space, with the exception of the string-

based Cosine, Dice’s Coefﬁcient and Overlap Coef-

ﬁcient. VSM cosine and NCDs roughly preserve the

cluster quality as in the original space. Unlike the

other cases iVSM cosine performed poorly in this

data.

It is worth noting that we did not consider the

Silhouette Coefﬁcient on the LSP projection because

distance computation in 2D space tends to favor

round-shaped clusters, and as such it is not necessarily

a meaningful measure of cluster quality in the visual

space when cluster shapes vary largely.

7 CONCLUSIONS

In our experiments we observed that VSM and iVSM

generated visualizations with the best class segrega-

tion capability. Similarity-based layouts of text col-

lections obtained using both models were compared

using Neighborhood Hit curves, for which values

close to 1.0 reﬂect layouts with good class preser-

vation capability. A global ranking summarizing the

major ﬁndis is presented in Table 6. The iVSM out-

performed, or otherwise stayed close, to the VSM in

most cases. Given the observed results, we propose

iVSM as a new incremental model based on VSM.

AStudyontheRoleofSimilarityMeasuresinVisualTextAnalytics

437

Coupled with incremental MDS techniques, e.g., in-

cBoard and incSpace, it is well-suited for handling

text streams and time-stamped document collections,

with limited recalculations.

Some string-based metrics also performed well in

the comparisons, in particular Qgram, string based

Cosine and Overlapping Coefﬁcient. Their major ad-

vantage is not requiring intermediate text representa-

tions such as the vector models, althoug distance cal-

culations are computationally expensive. A next step

is to evaluate iVSM and string measures in a truly in-

cremental setup, by applying them in displaying text

streams with, e.g., incBoard or incSpace.

The approaches considered disregard any kind of

semantic analysis of text. For instance, stemming in

preprocessing impacts semantics in a not very pre-

dictable manner. Although this type of processing

and dissimilarity calculation sufﬁces for many appli-

cations, further investigation should be conducted on

semantic-based distances, as semantics cannot be ig-

nored in some text analytics applications. The impact

of the language model also needs further study.

ACKNOWLEDGEMENTS

The authors acknowledge the support of FAPESP and

CNPq. Ideas and opinions expressed are those of the

authors and do not necessarily reﬂect those of their

employers or host organizations.

REFERENCES

Alsakran, J., Chen, Y., Luo, D., Zhao, Y., Yang, J., Dou,

W., and Liu, S. (2012). Real-Time Visualization of

Streaming Text with a Force-Based Dynamic System.

IEEE Comp. Graph. and Applic., 32(1):34–45.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent

dirichlet allocation. J. of Mach. Learn. Res., 3:993–

1022.

Cohen, W., Ravikumar, P., and Fienberg, S. (2003). A

Comparison of String Distance Metrics for Name-

Matching Tasks. In Proc. IJCAI-2003 Workshop on

Information Integration on the Web, pages 73–78.

Cuadros, A. M., Paulovich, F. V., Minghim, R., and Telles,

G. P. (2007). Point Placement by Phylogenetic Trees

and its Application to Visual Analysis of Document

Collections. In Proc. 2007 IEEE Symp. Vis. Analytics

Sci. and Techn., pages 99–106.

Huang, S., Ward, M., and Rundensteiner, E. (2005). Ex-

ploration of Dimensionality Reduction for Text Vi-

sualization. In Proc. Coord. and Mult. Views in Ex-

ploratory Vis., pages 63–74.

Kempken, S., Luther, W., and Pilz, T. (2006). Comparison

of distance measures for historical spelling variants.

In Artif. Intel. Theory and Prac., pages 295–304.

Landauer, T. K., McNamara, D. S., Dennis, S., and Kintsch,

W. (2007). Handbook of Latent Semantic Analysis.

Lawrence Erlbaum Assoc.

Lopes, A. A., Pinho, R., Paulovich, F. V., and Minghim,

R. (2007). Visual text mining using association rules.

Comp & Graph., 31(3):316–326.

Paiva, J. G. S., Florian, L., Pedrini, H., Telles, G. P., and

Minghim, R. (2011). Improved Similarity Trees and

their Application to Visual Data Classiﬁcation. IEEE

Trans. on Vis. and Comp. Graph., 17(12):2459–2468.

Paulovich, F. V. and Minghim, R. (2008). HiPP: A Novel

Hierarchical Point Placement Strategy and its Appli-

cation to the Exploration of Document Collections.

IEEE Tran. Vis. and Comp. Graph., 14(6):1229–1236.

Paulovich, F. V., Nonato, L. G., Minghim, R., and Lev-

kowitz, H. (2008). Least Square Projection: A Fast

High-Precision Multidimensional Projection Tech-

nique and its Application to Document Mapping.

IEEE Trans. Vis. and Comp. Graph., 14(3):564–575.

Pinho, R., de Oliveira, M. C. F., and Lopes, A. A. (2009).

Incremental board: a grid-based space for visualizing

dynamic data sets. In Proc. .2009 ACM Symp. Appl.

Comp., pages 1757–1764.

Pinho, R., de Oliveira, M. C. F., and Lopes, A. A. (2010).

An incremental space to visualize dynamic data sets.

Multimedia Tools and Appl., 50(3):533–562.

Salton, G., Wong, A., and Yang, C. S. (1975). A Vec-

tor Space Model for Automatic Indexing. Commun.

ACM, 18(11):613–620.

Tan, P. N., Steinbach, M., and Kumar, V. (2005). Introduc-

tion to Data Mining. Addison-Wesley.

Telles, G. P., Minghim, R., and Paulovich, F. V. (2007). Nor-

malized compression distance for visual analysis of

document collections. Comp. & Graph., 31(3):327–

337.

Wei, F., Liu, S., Song, Y., Pan, S., Zhou, M. X., Qian, W.,

Shi, L., Tan, L., and Zhang, Q. (2010). TIARA: A

Visual Exploratory Text Analytic System. In Proc. .

16th ACM SIGKDD Int. Conf. on Knowl. Discovery

and Data Min., pages 153–162.

Wise, J. A., Thomas, J. J., Pennock, K., Lantrip, D., Pottier,

M., Schur, A., and Crow, V. (1995). Visualizing the

non-visual: spatial analysis and interaction with in-

formation from text documents. In Proc. .1995 IEEE

Symp. Inf. Vis., pages 51–58.

IVAPP2013-InternationalConferenceonInformationVisualizationTheoryandApplications

438