clusters (middle), while the cluster on the right is
more homogeneous (right). Another issue with the
cell size parameter is that clusters are not necessarily
changing smoothly when smoothly varying the cell
size parameter. To alleviate this issue we introduced
an animation as in Figure 5.
5 EVALUATION
To evaluate the effectiveness of our tool, we
performed a user study with 10 subjects with
different professional background, gender, and age.
We gave a short tutorial and subsequently asked 8
easy questions that should familiarize the subjects
with the functionality of the tool. We asked about
the number of dimensions, number of samples, the
dimension with the broadest range, the dimension
contributing most to the variance, the number of
outliers in one dimension, the number of visible
structures in the PCA view, a comparison between
clusters in terms of size, are, and density, and
correctness of an aggregated view. Afterwards, we
asked them to perform actual analysis tasks like
identifying the correct number of clusters, testing
clusters on homogeneity, and finding the most
similar observations to an outlier. All tasks were
conducted on the Fake Clover dataset. The outcome
was evaluated by computing the correctness of the
answers. Time was not part of the investigation, but
the study took on average 66 minutes (ranging
between 29 and 98 minutes) per participant.
The outcome of the user study was that subjects
were able to fulfil the tasks with a high average
correctness rate of 90.0% (92.5% for easy questions
and 83.3% for actual analysis tasks). There was no
difference in performance between groups of
different professional background.
6 CONCLUSIONS
We presented an interactive visual tool for
effectively analysing unlabeled multi-dimensional
data using data aggregation and distance encoding.
Data aggregation is based on K-means clustering
and a cell-based density clustering. The cell size
allowed us to modify the granularity of the data
aggregation. Cluster properties are visually encoded
in aggregated form using color and size or in
detailed form using circular parallel plots and
scatterplots in a local layout. Distances are
computed in an efficient way and conveyed by
ending k-nearest neighborhoods with edges, which
allows for analysing the neighbourhood preservation
property of the chosen projection. Projections are
based on PCA, but a dimension-scaling widget
allows for interactive weighting of axes and a star-
coordinate widget allows for changing the projection
matrix. We have shown that our tool can be
effectively applied to analyze multi-dimensional
data.
REFERENCES
K. Bache and M. Lichman. 2013. UCI Machine Learning
Repository. Available: http://archive.ics.uci.edu/ml.
R. E. Bellman, Dynamic Programming, Princeton, NJ,
Princeton Univ. Press, 1957.
R. S. Bennett, “Representation and Analysis of Signals –
Part XXI. The Intrinsic Dimensionality of Signal
Collections,” Dept. of Elect. Eng. and Comp. Science,
Johns Hopkins Univ., Baltimore, MD, Rep.
AD0475844, Dec. 1965.
A. P. Dempster, N.M. Laird, and D.B. Rubin (1977).
"Maximum Likelihood from Incomplete Data via the
EM Algorithm". Journal of the Royal Statistical
Society, Series B 39 (1): 1–38. JSTOR 2984875. MR
0501537.
M. Ester, H. Kriegel, J. Sander and X. Xu, “A Density-
Based Algorithm for Discovering Clusters in Large
Spatial Databases with Noise,” Proc. KDD, pp. 226-
231, 1996.
B. Flury, in A First Course in Multivariate Statistics, New
York, USA, Springer New York, 1997.
S. Har-Peled and B. Sadri, “How Fast is the k-means
Method?”, Algorithmica, vol. 41, no. 3, pp. 185- 202,
2005.
V. Hautamäki, S. Cherednichenko, I. Kärkkäinen, T.
Kinnunen, and P. Fränti, “Improving K-Means by
Outlier Removal,” Proc. SCIA, pp. 978-987, 2005.
C. G. Healey, “Effective Visualization of Large
Multidimensional Datasets”, Ph.D. dissertation, Dept.
of Comp. Science, Univ. of British Columbia,
Vancouver, Canada, 1996.
I. Ilies, "Cluster Analysis for Large, High-Dimensional
Datasets: Methodology and Application," Ph.D.
dissertation, School of Humanities and Social
Sciences, Jacobs University Bremen, Bremen,
Germany, 2010.
E. Kandogan, “Star Coordinates: A Multi-dimensional
Visualization Technique with Uniform Treatment of
Dimensions,” Proc. IEEE InfoVis Symposium, 2000.
P. Ketelaar. (2005, July 2005) Out5d Data Set. Available:
http://davis.wpi.edu/xmdv/datasets/out5d.html.
T. V. Long and L. Linsen, “MultiClusterTree: Interactive
Visual Exploration of Hierarchical Clusters in
Multidimensional Multivariate Data,” in
Eurographics/IEEE-VGTC Symposium on
Visualization, 2009.
IVAPP 2018 - International Conference on Information Visualization Theory and Applications
234