Threads are grouped into blocks which are part of the
block grid. The number of threads in a block is typ-
ically higher than the number of cores in a multipro-
cessor, so the threads from a block are divided into
smaller batches, called warps. The computation of
one block is independent from other blocks and each
block is scheduled to one multiprocessor, where sev-
eral warps may be active at the same time. CUDA
employs a single-instruction multiple-thread execu-
tion model (Lindholm et al., 2008), where all threads
in a block execute the same instruction at any time.
GPUs have several memories, depending on the
model. Accessible to all cores are the global mem-
ory, constant memory and texture memory, of which
the last two are read-only. Blocks share a smaller
but significantly faster memory called shared mem-
ory, which resides inside the multiprocessor to which
all threads in a block have access to. Finally, each
thread has access to local memory, which resides in
global memory space and has the same latency for
read and write operations.
2.2 Parallel K-Means
Several parallel implementations of K-Means for
GPU exist in the literature (Zechner and Granitzer,
2009; Sirotkovi et al., 2012) and all report speed-
ups relative to their sequential counterparts. Our im-
plementation was programmed in Python using the
Numba CUDA API, following the version of Zech-
ner and Granitzer (2009), with the difference that we
do not use shared memory. We parallelize the label-
ing step, which was further motivated from empirical
data suggesting that, on average, 98% of the time was
spent in the labeling step. Our implementation starts
by transferring the data to the GPU (pattern set and
initial centroids) and allocates space in the GPU for
the labels and distances from patterns to their closest
centroids. Initial centroids are chosen randomly from
the dataset and the dataset is only transferred once.
As shown in Algorithm 1, each GPU thread will com-
pute the closest centroid to each point of its corre-
sponding set, X
T hreadID
, of 2 data points (2 points per
thread proved to yield the highest speed-up). The la-
bels and corresponding distances are stored in arrays
and sent back to the host afterwards for the computa-
tion of the new centroids. The implementation of the
centroid recomputation, in the CPU, starts by count-
ing the number of patterns attributed to each centroid.
Afterwards, it checks if there are any ”empty” clus-
ters. Dealing with empty clusters is important be-
cause the target use expects that the output number
of clusters be the same as defined in the input param-
eter. Centroids corresponding to empty clusters will
be the patterns that are furthest away from their cen-
troids. Any other centroid c
j
will be the mean of the
patterns that were labeled j. In our implementation
several parameters can be adjusted by the user, such
as the grid topology and the number of patterns that
each thread processes.
Input: dataset, centroids c
j
, ThreadID
Output: labels, distances
forall x
i
∈ X
T hreadId
do
labels
i
← arg min D(x
i
, c
j
);
distances
i
← D(x
i
, c
labels
i
);
end
Algorithm 1: GPU kernel for the labeling phase.
3 COMBINATION
Space complexity is the main challenge building the
co-association matrix. A complete pair-wise n ×n co-
association matrix has O(n
2
) complexity but can be
reduced to O(
n(n−1)
2
) without loss of information, due
to its symmetry. Still, these complexities are rather
high when large datasets are contemplated, since it
becomes infeasible to fit these co-association matri-
ces in the main memory of a typical workstation.
(Lourenc¸o et al., 2010) approaches the problem by
exploiting the sparse nature of the co-association ma-
trix, but does not cover the efficiency of building a
sparse matrix or the overhead associated with sparse
data structures. The effort of the present work was fo-
cused on further exploiting the sparse nature of EAC,
building on previous insights and exploring the topics
that literature has neglected so far.
Building a non-sparse matrix is easy and fast since
the memory for the whole matrix is allocated and in-
dexing the matrix is direct. When using sparse ma-
trices, neither is true. In the specific case of EAC,
there is no way to know what is the number of as-
sociations the co-association matrix will have which
means it is not possible to pre-allocate the memory to
the correct size of the matrix. This translates in al-
locating the memory gradually which may result in
fragmentation, depending on the implementation, and
more complex data structures, which incurs signifi-
cant computational overhead. For building a matrix,
the DOK (Dictionary of Keys) and LIL (List of Lists)
formats are recommended in the documentation of the
SciPy (Jones et al., 2001) scientific computation li-
brary. These were briefly tested on simple EAC prob-
lems and not only was their execution time several or-
ders of magnitude higher than a traditional fully allo-
cated matrix, but the overhead of the sparse data struc-
tures resulted in a space complexity higher than what
ICPRAM 2016 - International Conference on Pattern Recognition Applications and Methods
368