REFERENCES
Agresti, A. and Coull, B. A. (1998). Approximate Is Bet-
ter than ”Exact” for Interval Estimation of Binomial
Proportions. The American Statistician.
Boothroyd, G. (2005). Assembly Automation and Product
design. CRC Press, 2nd ed. edition.
Brochu, E., Cora, V. M., and de Freitas, N. (2010). A tuto-
rial on bayesian optimization of expensive cost func-
tions, with application to active user modeling and hi-
erarchical reinforcement learning. CoRR.
Brown, L. D., Cai, T. T., and Dasgupta, A. (2001). Inter-
val estimation for a binomial proportion. Statistical
Science.
H
¨
ardle, W., Werwatz, A., M
¨
uller, M., and Sperlich, S.
(2004). Nonparametric and semiparametric models.
Springer Berlin Heidelberg.
Laursen, J., Sorensen, L., Schultz, U., Ellekilde, L.-P., and
Kraft, D. (2018). Adapting parameterized motions us-
ing iterative learning and online collision detection.
pages 7587–7594.
Mathiesen, S., Sørensen, L. C., Kraft, D., and Ellekilde, L.-
P. (2018). Optimisation of trap design for vibratory
bowl feeders. pages 3467–3474.
Rasmussen, C. and Williams, C. (2006). Gaussian Pro-
cesses for Machine Learning. Adaptive Computation
and Machine Learning. MIT Press, Cambridge, MA,
USA.
Ross, S. M. (2009). Introduction to Probability and Statis-
tics for Engineers and Scientists. Acedemic Press, 4th
edition.
Sørensen, L. C., Buch, J. P., Petersen, H. G., and Kraft, D.
(2016). Online action learning using kernel density
estimation for quick discovery of good parameters for
peg-in-hole insertion. In Proceedings of the 13th In-
ternational Conference on Informatics in Control, Au-
tomation and Robotics.
Tesch, M., Schneider, J. G., and Choset, H. (2013). Ex-
pensive function optimization with stochastic binary
outcomes. In Proceedings of the 30th International
Conference on Machine Learning (ICML).
APPENDIX
The Effect of the Bias and Variance
Error in Relation to KDE and WSKDE
The true confidence interval consists of both a bias
and variance error, however, the bias term has to
be neglected to make confidence interval calculable
(see (9)). The variance term includes f (x) which can
be approximated by
ˆ
f (x), but, unfortunately, the bias
term also includes m
0
(x), m
00
(x) and f
0
(x), which can-
not be approximated properly. Note, the bias and vari-
ance errors can be suppressed by letting h → 0 and
nh → ∞ respectively.
In general, the bias is the vertical difference be-
tween the estimate and the true function and arises
from smoothing effect. This smoothing effect drags
down maxima and pulls up minima of the function
estimate, ˆm(x), compared to m(x). In addition, the
bias is proportional to only m
00
(x) in extrema. Hence,
neglecting the bias error but assuming that m
00
(x)
does not displace the optimum with respect to x, then
ˆx
opt
= x
opt
even though max( ˆm(x)) < max(m(x)).
This assumption requires that important function de-
tails are not smoothed-out and is acceptable when
choosing h appropriately. Furthermore, neglecting the
bias error will offset the confidence interval estimate
compared to the true confidence interval such that the
estimated bounds are raised at minima and lowered at
maxima. For further details see (H
¨
ardle et al., 2004).
Neglecting the KDE regression bias error will also
be reflected in the WSKDE mean and confidence in-
terval estimates, since the KDE regression mean, ˆm
h
,
directly replaces the Normal Approximation mean,
ˆp
na
, as shown in (20). However, the bias error will
be suppressed in sparsely sampled regions due to the
few samples correction of WS (the WS confidence
interval goes towards [0 ; 1] with mean of 0.5 when
n → 0). Regardless the neglection of the KDE regres-
sion bias error, our derivation of WSKDE is still valid
since it is only based on a comparison of the variance
terms of WS and KDE.
Generalization to Multiple Dimensions
The equations of KDE and WSKDE can be general-
ized to multiple dimensions. Hence, the kernel, K,
becomes a multi-dimensional kernel with bandwidth
matrix H, which must be symmetric and positive def-
inite. Whenever the bandwidth, h, is used as a scalar
as in (9) or (20), this becomes the determinant of the
bandwidth matrix |H|. For a multi-normal Gaussian
kernel, ||K||
2
2
is calculated as 1/(2
d
√
π
d
) where d is
the number of dimension, and this constant scalar is
therefore not dependent on the bandwidth of the ker-
nel. Note, the discrete function estimators NA and
WS do not change when going to multiple dimen-
sions, since these are only related to a certain param-
eter set without the influence of experiments made in
neighboring region as when using kernel smoothing.
Wilson Score Kernel Density Estimation for Bernoulli Trials
313