If we want a more accurate estimate of the inner
product using Li’s method, we can either use a root
finding method to find a where f (a) = 0, or use the
cubic formula to get the root(s) of a degree 3 poly-
nomial. The time for these methods are are bounded
above by some constant number of operations.
3 OUR EXPERIMENTS
Throughout our experiments, we use five different
types of random projection matrices as shown in Ta-
ble 1. We pick these five types of random projection
matrices as they are commonly used random projec-
tion matrices.
We use N(0, 1) to denote the Normal distribution
with mean µ = 0 and σ
2
= 1. We denote (1)
p
to be the
length p vector with all entries being 1, and (0)
p
to be
the length p vector with all entries being 0. We denote
the baseline estimates to be the respective estimates
given by using the type of random projection matrix
R
i
.
We run our simulations for 10000 iterations for
every experiment.
3.1 Generating Vectors from Synthetic
Data
We first perform our experiments on a wide range of
synthetic data. We look at normalized pairs of vectors
x
1
, x
2
∈ R
5000
generated from the following distribu-
tions in Table 2. In short, we look at data that can
be Normal, heavy tailed (Cauchy), sparse (Bernoulli),
and an adversarial scenario where the inner product is
zero.
We look at the plots of the ratio ρ defined by
ρ =
Variance using control variate with R
i
Variance using baseline with R
i
(35)
in Figure 1 for the Euclidean distance. ρ is a measure
of the reduction in variance using RPCV with the ma-
trix R
i
rather than just using R
i
alone. For this ratio, a
fraction less than 1 means RPCV performs better than
the baseline.
For all pairs x
i
,x
j
except Cauchy, the reduction
of variance of the estimates of the Euclidean distance
using different R
i
s with RPCV converge quickly to
around the same ratio. However, when data is heavy
tailed, the choice of random projection matrix R
i
with
RPCV affects the reduction of variance in the esti-
mates of the Euclidean distance, and sparse matrices
R
i
have a greater variance reduction for the estimates
of the Euclidean distance.
We next look at the estimates of the inner prod-
uct. In our experiments, we use Li et al., 2006a’s
method as the baseline for computing the estimates of
the inner product. Our rationale for doing this is that
both Li’s method and our method stores the marginal
norms of X, thus we should compare our method with
Li’s method for a fair comparison. The ratio of vari-
ance reduction is shown in Figure 2.
As the number of columns k of the random pro-
jection matrix R increases, the variance reduction in
our estimate of the inner product decreases, but then
increases again up to a ratio just below 1. Since Li’s
method uses an asymptotic maximum likelihood es-
timate of the inner product, then as the number of
columns of R increases, the estimate of the inner prod-
uct would be more accurate.
Thus, it is reasonable to use RPCV for Euclidean
distances, and Li’s method for inner products.
3.2 Estimating the Euclidean distance
of vectors with real data sets
We now demonstrate RPCV on two datasets, the
colon dataset from Alon et al., 1999 and the kos
dataset from Lichman, 2013.
The colon dataset is an example of a dense dataset
consisting of 62 gene expression levels with 2000 fea-
tures, and thus we have x
i
∈ R
2000
, 1 ≤ i ≤ 62.
The kos dataset is an example of a sparse
dataset consisting of 3430 documents and 6906 words
from the KOS blog entries, and thus we have x
i
∈
R
3430
, 1 ≤ i ≤ 6906.
We normalize each dataset such that every obser-
vation kx
i
k
2
2
= 1.
For each dataset, we consider the pairwise Eu-
clidean distances of all observations {x
i
,x
j
}, ∀ i 6= j,
and compute the estimates of the Euclidean distance
with RPCV of the pairs {x
i
,x
j
} which give the 20th,
30th, ..., 90th percentile of Euclidean distances.
We pick a pair in the 50th percentile for both the
colon and kos datasets (Figure 3 and Figure 4), and
show that for every different R
i
, the bias quickly con-
verges to zero, and that the variance reduction for the
R
i
s are around the same range. Since the bias con-
verges to zero, this implies that our control variates
work. i.e., we do not get extremely biased estimates
with lower variance.
We now look at the variance reduction for pairs
from the 20th to 90th percentile of Euclidean dis-
tances from both datasets for R
1
(where r
i j
∼N(0,1)).
This is shown in Figure 5. We omit plots of the biases,
as well as plots of ρ varying for different random ma-
trices R
2
to R
5
since the variance reduction follows a
similar trend.
Random Projections with Control Variates
143