Coupled with incremental MDS techniques, e.g., in-
cBoard and incSpace, it is well-suited for handling
text streams and time-stamped document collections,
with limited recalculations.
Some string-based metrics also performed well in
the comparisons, in particular Qgram, string based
Cosine and Overlapping Coefficient. Their major ad-
vantage is not requiring intermediate text representa-
tions such as the vector models, althoug distance cal-
culations are computationally expensive. A next step
is to evaluate iVSM and string measures in a truly in-
cremental setup, by applying them in displaying text
streams with, e.g., incBoard or incSpace.
The approaches considered disregard any kind of
semantic analysis of text. For instance, stemming in
preprocessing impacts semantics in a not very pre-
dictable manner. Although this type of processing
and dissimilarity calculation suffices for many appli-
cations, further investigation should be conducted on
semantic-based distances, as semantics cannot be ig-
nored in some text analytics applications. The impact
of the language model also needs further study.
The authors acknowledge the support of FAPESP and
CNPq. Ideas and opinions expressed are those of the
authors and do not necessarily reflect those of their
employers or host organizations.
