4.2 Random Projection
Random projection is a powerful method for dimen-
sionality reduction (Keogh and Pazzani, 2000). In
random projection, the original d-dimensional data
is projected to a k-dimensional (k << d) subspace
through the origin, using a random k × d matrix R
whose columns have unit lengths. Using matrix no-
tation where X is the original set of N d-dimensional
observations, the algorithm is a projection of the data
onto lower k-dimensional subspace. The idea of ran-
dom mapping arises from the Johnson-Lindenstrauss
lemma (Dasgupta and Gupta, 1999)(Dasgupta, 2000),
if points in a vector space are projected onto a ran-
domly selected subspace of suitable high dimension,
then the distances between the points are approxi-
mately preserved (Bingham et al., 2001). Random
projection has low computational complexity. It con-
sists of forming the random matrix R and projecting
the d×N data matrix X into k dimensions with com-
plexity O(dkN) and if the data matrix X is sparse with
about c nonzero entries per column, the complexity
is of order O(ckN). Our implementation of Random
Projection is based on (Bingham et al., 2001).
5 DIMENSIONALITY
REDUCTION AND COSINE
METRIC IMPLEMENTATIONS
IN GPU
This section describes implementations of both men-
tioned dimensionality reduction algorithm and cosine
metric for sparse and dense vectors.
5.1 GPGPU Hardware Platform
GPGPU is constructed as group of multiprocessors
with multiple cores each. The cores share an In-
struction Unit with other cores in a multiprocessor.
Multiprocessors have dedicated memory chips which
are much faster than global memory, shared for all
multiprocessors. These memories are: read-only
constant/texture memory and shared memory. The
GPGPU cards are constructed as massive parallel de-
vices, enabling thousands of parallel threads to run
which are grouped in blocks with shared memory.
This technology provides three key mechanisms to
parallelize programs: thread group hierarchy, shared
memories, and barrier synchronization. These mech-
anisms provide fine-grained parallelism nested within
coarse-grained task parallelism. Creating the opti-
mized code is not trivial and thorough knowledge
of GPGPUs architecture is necessary to do it effec-
tively. The main aspects to consider are the usage
of the memories, efficient division of code into paral-
lel threads and thread communications. Another im-
portant thing is to optimize synchronization and the
communication of the threads. The synchronization
of the threads between blocks is much slower than in
a block. If it is not necessary it should be avoided, if
necessary, it should be solved by the sequential run-
ning of multiple kernels.
5.2 PCA NIPALS and Random
Projection implementation
The PCA/SVD Nipals algorithm is based on non-
linear regression. SVD factorization can also be asso-
ciated with another matrix transformation - Principal
Component Analysis. The left singular vectors of in-
put matrix X multiplied by the corresponding singular
value equal single score vector in PCA method. The
non-linear iterative partial least squares (NIPALS) al-
gorithm calculates score vectors (T) and load vectors
(P) from input data (X). The outer product of these
vectors can then be subtracted from input data (X)
leaving the residual matrix (R). Residual matrix can
be then used to calculate subsequent principal compo-
nents. The NIPALS algorithm consists of sequential
iterations responsible for computing single orthogo-
nal component. The outer iterations find iteratively
projections of input data to the principal components
which inherit the maximum possible variance from
the input (using non- linear regression). Each itera-
tion consists of few matrix-vector operations (multi-
plication). These operations are main parts of the al-
gorithm which can be parallelized. Our parallel GPU
implementation is based on running highly optimized
matrix-vector instruction from CuBLAS library. The
core of the algorithm is as follows:
// PCA model: X = TP + R
// input: X, MxN matrix (data)
// M = number of rows in X
// N = number of columns in X
// K = number of components (K<=N)
// output: T, MxK scores matrix
// output: P, NxK loads matrix
// output: R, MxN residual matrix
for(k=0; k<K; k++)
{
cublasScopy (M, &dR[k*M], 1, &dT[k*M], 1);
a = 0.0;
for(j=0; j<J; j++)
{
cublasSgemv(’T’, M, N, 1.0, dR, M, \
&dT[k*M], 1, 0.0, &dP[k*N], 1);
PUaNLP 2016 - Special Session on Partiality, Underspecification, and Natural Language Processing
318