ceptually viable 3D space. Figures 5 and 6 further il-
lustrate stream renditions of the period and amplitude
feature groups, respectively. Each frame of reference
contains 452 patient samples perceived from a pre-
scribed spatial angle, and displayed in a three dimen-
sional space. A closer observation identifies distinct
and vital data needed for arrhythmia clustering, but
there are also clear indications for instances that over-
lay each other, although they might belong to a differ-
ent arrhythmia class, and furthermore, some samples
deviate distantly from the formed partitions. Hence,
our contention that a clustering method to operate ef-
fectively on arrhythmia streams captured from a large
number of patients, must grant clusters to form any
geometrical shape of variable size, and properly ad-
dress data point overlaps. In addition, the algorithm
is required to be robust in the presence of outliers. We
selected the hierarchical Clustering Using REpresen-
tatives (CURE) (Guha et al., 1998) method, explicitly
designed to support our set forth prerequisites.
Cluster analysis of ECG recordings is a powerful
tool for discovering patients of similar arrhythmia dis-
orders. In contrast to supervised methods, our evalu-
ation proceeds anonymously on presumed unlabeled
cardiac data without resorting to prior knowledge of
a cardiologist assessment. Several clustering meth-
ods have been devised in the domain of data mining,
each of its own strengths and shortcomings. For the
sake of keeping the description concise, the reader is
kindly referred to an excellent survey of clustering
methods exclusively applied to time series data (Liao,
2005). CURE stands out as a highly efficient, hierar-
chical clustering algorithm (Johnson, 1967) that has
linear storage requirements O(n) and time complexity
of O (n
2
) for low dimensional data of n points, each of
d dimensionality, and is no worse than the more con-
strained, centroid-based hierarchical method. CURE
is agglomerative and starts by placing each individ-
ual data point in a cluster of its own, and successively
merges the closest pair of clusters until the number of
clusters reduces to k. Each cluster contains a set of
representative points, c, chosen to be well scattered in
the cluster extent, and are further shrunk towards the
cluster centroid by a fractional factor α. Set apart ref-
erence points and the contraction operation that fol-
lows, serve the objectives for capturing a cluster of
arbitrary geometrical profile and mitigating the effects
presented by outliers, respectively. The distance be-
tween a cluster pair, u and v, is delineated by the clos-
est pair of representative points, p and q, one from
each of the clusters
dist(u,v) = min
p∈u.rep,q∈v.rep
dist(p,q). (2)
As the distance between two points, p and q, often
takes a Euclidean form of L
1
-norm or L
2
-norm met-
rics, but also a nonmetric similarity function. Our
cardiac feature vectors mix real, integer and boolean
components and data points may be rather thought of
as directions (Baeza-Yates and Ribeiro-Neto, 1999) in
the vector space model (Salton et al., 1975). Hence,
we chose the adjusted cosine similarity for a distance
measure that computes a 0 to 180 degrees angle be-
tween two zero-mean point vectors and is defined as
sim(p, q) =
(p − p
m
)(q − q
m
)
k(p − p
m
)k
2
k(q − q
m
)k
2
, (3)
where p
m
and q
m
are the mean of p and q, respec-
tively. Adjusted cosine similarity is widely used in
the domain of item-based collaborative filtering.
5 EMPIRICAL EVALUATION
To validate our system in practice, we have imple-
mented a software library that realizes the cluster
analysis of ECG streams in several stages. After col-
lecting and cleaning the archived cardiac data, our li-
brary commences with extracting patient, global, pe-
riod and amplitude based feature vectors. Our fea-
tures are regarded as unlabeled, and follow an explicit
clustering process. In addition to detecting the pres-
ence or absence of arrhythmia individually, each of
the constructed groups represent an objective arrhyth-
mia class and our goal is to further explore and quan-
tify the relations between automatically machine gen-
erated clusters to cardiologist diagnoses.
5.1 Experimental Setup
Our work exploits the R programming language (R,
1997) to acquire the raw arrhythmia data and fosters
cleanup to serve useful in our software environment.
We use the extensive and well maintained arrhythmia
dataset from the UCI Machine Learning Repository
(UCI, 1987), comprised of 452 patient instances with
each ECG trace represented as a 279 feature vector
elements, and chose to impute missing values, mani-
fested primarily in the axis orientation columns, with
the mean of the present feature items. For our study,
we selected the time series attributes held in the 12-
perspective, period and amplitude cardiac groups that
total a majority of 264 features, intentionally leaving
patient and global properties outside the scope of this
work. The measured figures of the dynamic signal
were obtained using the ECG system jointly devel-
oped by IBM and Mount Sinai University Hospital.
As a point of reference, expert cardiologist evaluation
BeatDiscoveryfromDimensionalityReducedPerspectiveStreamsofElectrocardiogramSignalData
43