On the Capability of Neural Networks to Approximate
the Neyman-Pearson Detector. A Theoretical Study
P. Jarabo-Amores
?
, R. Gil-Pita, M. Rosa-Zurera, and F. L
´
opez-Ferreras
Departamento de Teor
´
ıa de la Se
˜
nal y Comunicaciones,
Escuela Polit
´
ecnica Superior, Universidad de Alcal
´
a
Ctra. Madrid-Barcelona, km. 33.600, 28805, Alcal
´
a de Henares - Madrid (SPAIN).
Abstract. In this paper, the application of neural networks for approximating
the Neyman-Pearson detector is considered. We propose a strategy to identify
the training parameters that can be controlled for reducing the effect of approx-
imation errors over the performance of the neural network based detector. The
function approximated by a neural network trained using the mean squared-error
criterion is deduced, without imposing any restriction on the prior probabilities
of the clases and on the desired outputs selected for training, proving that these
parameters play an important role in controlling the sensibility of the neural net-
work detector performance to approximation errors. Another important parameter
is the signal-to-noise ratio selected for training. The proposed strategy allows to
determine its best value, when the statistical properties of the feature vectors are
known. As an example, the detection of gaussian signals in gaussian interference
is considered.
1 Introduction
The objective of this paper is to study the capability of neural networks to approximate
a Neyman-Pearson detector. This detector maximices the probability of detection (P
D
),
while maintaining the probability of false alarm (P
F A
) lower than or equal to a spec-
ified value. The characteristics of such a detector are reflected in its ROC (Receiver
Operating Characteristic) curve, that relates P
D
to P
F A
[1].
Ruck et al. [2], and Wan [3], demonstrated that a neural network can be used to
approximate the optimum bayessian classifier when trained using the mean squared-
error criterion.
In previous works, neural networks have been proposed for approximating the
Neyman-Pearson detector in different environments [4][5]. These works highlighted the
strong dependence of the neural network-based detector performance on the signal-to-
noise ratio selected for training (TSNR). They also observed that the difference between
the neural detector performance and the Neyman-Pearson detector one, depends on the
desired P
F A
, and so, on the corresponding detection threshold.
Recently, some attempts to reduce the dependence of the neural detector perfor-
mance on TSNR have been carried out [6], based on the use of a complex pre-processing
?
This work has been supported by the “Consejer
´
ıa de Educaci
´
on de la Comunidad de Madrid”
(SPAIN), under Project 07T/0036/2003 1
Jarabo-Amores P., Gil-Pita R., Rosa-Zurera M. and López-Ferreras F. (2004).
On the Capability of Neural Networks to Approximate the Neyman-Pearson Detector - A Theoretical Study.
In Proceedings of the First International Workshop on Artificial Neural Networks: Data Preparation Techniques and Application Development, pages
67-74
DOI: 10.5220/0001150100670074
Copyright
c
SciTePress
stage that reduce this dependence at the expense of a high computational cost. Never-
theless, no effort has been made in order to explain the reasons of such a dependence,
whose knowledge could help to easily select the best TSNR and training strategy for
designing neural network-based detectors which approximate the Neyman-Pearson de-
tector.
This paper deals with the theoretical explanation of the effects of the approxima-
tion errors in the performance of neural network based detector for approximating the
Neyman-Pearson detector.
2 Problem Formulation
The performance of a detector that approximates the Neyman-Pearson detector must be
evaluated from the difference between its ROC curve and the Neyman-Pearson detector
one. For a given P
F A
, the difference between the probabilities of detection must be as
lower as possible. The decrease in P
D
that is observed in the ROC curve for a given
P
F A
, is expressed in (1):
4P
D
=
P
D
P
F A
P
F A
η
4η (1)
Practical P
F A
values are below 10
6
, while practical P
D
values can be higher than
0.8, so, in practical conditions, ROC curves have high positive slopes for low values
of P
F A
, and low positive slopes for high values of P
F A
. Besides, the function that
relates P
F A
to detection threshold, η, has a negative slope. Taking into consideration
this characteristic, and the fact that the ROC curve of the Neyman-Pearson detector is a
characteristic that cannot be modified, the following conclusions can be extracted:
In the low P
F A
region,
P
D
P
F A
is usually very high. For the reduction of P
D
being
low, the magnitude of
P
F A
η
4η must be very low.
In the high P
F A
region,
P
D
P
F A
is usually very low. So a low reduction in P
D
can be
guaranteed although the magnitude of
P
F A
η
4η is big.
The Neyman-Pearson detector decision rule is the result of comparing the likeli-
hood ratio, or any other equivalent statistic, to the detection threshold, η. For a desired
P
F A
, this threshold depends on the expression of the selected statistic. In order to study
the P
D
decrease due to approximation errors, expression (1) must be calculated. Taking
this study as starting point, it is possible to identify the design parameters that minimize
expression (1), with independence of the mean squared-error minimization strategy se-
lected.
3 Expression of the approximated discriminant function
D. W. Ruck et al. [2] demonstrated that a multilayer perceptron (MLP) converges to a
mean squared-error approximation of the Bayes optimal discriminant function, when
68
trained using the mean squared-error criterion. They study the two-class and multiclass
problems, and extended this result to any mean squared-error minimization technique.
For binary detection, they studied a MLP with only one neuron in the output layer.
The network was trained to produce 1 when the feature vector was from class H
1
and
1 when the vector is from class H
0
. They proved that the neural network output ap-
proximates the Bayes optimal discriminant function g
0
(z), given in (2) where z is the
feature vector, and P (H
1
|z) and P(H
0
|z) are the a posteriori probability of the classes.
g
0
(z) = P (H
1
|z) P (H
0
|z) (2)
The mean squared-error between the network output, F (z, W), for a given set of
weights, W, and the desired outputs, is given by (3). E
s
(W) is the sample mean error
calculated for a set of n pre-classified feature vectors, and Z
1
and Z
0
are the sets of all
possible feature vectors for class H
1
and H
0
, respectively (Z
0
Z
1
= Z, Z
0
Z
1
= ,
Z being the input space).
E
m
(W) = lim
n→∞
E
s
(W)
n
= lim
n→∞
1
n
[
X
zZ
1
(F (z, W) 1)
2
+
X
zZ
0
(F (z, W) + 1)
2
] (3)
Using the Strong Law of Large Numbers, expression (3) can be expressed as (4).
Finally, applying the Bayes Formula and rearranging terms, (4) converts into (5).
E
m
(W) = P (H
1
)
Z
Z
(F (z, W)1)
2
f(z|H
1
)dz+P (H
0
)
Z
Z
(F (z, W)+1)
2
f(z|H
0
)dz (4)
E
m
(W) =
Z
Z
(F (z, W) g
0
(z))
2
f(z)dz +
£
1
Z
Z
g
2
0
(z)f(z)dz
¤
(5)
If the training set represents a reasonable approximation to the input space, although
the network is trained for minimizing E
s
(W), E
m
(W) will be minimized. Since the
tern in braces in expression (5) is independent of W, minimizing E
m
(W) is equivalent
to minimizing (6). So the network output is an approximation of the Bayes optimal
discriminant function in the mean squared-error sense.
E(W) =
Z
Z
(F (z, W) g
0
(z))
2
f(z)dz (6)
In a more general problem, if the network is trained to produce t
H
1
when the feature
vector is from class H
1
and t
H
0
when the feature vector is from class H
1
, expression
(4) converts into (7):
E
m
(W) = P (H
1
)
Z
Z
(F (z, W)t
H
1
)
2
f(z|H
1
)dz+P (H
0
)
Z
Z
(F (z, W)t
H
0
)
2
f(z|H
0
)dz
(7)
In this case, the network output is a mean squared-error approximation of the func-
tion f
0
(z) defined in (8).
69
f
0
(z) =
P (H
1
)f(z|H
1
)t
H
1
+ P (H
0
)f(z|H
0
)t
H
0
P (H
1
)f(z|H
1
) + P (H
0
)f(z|H
0
)
(8)
If η
net
is the detection threshold for a given P
F A
, the decision rule approximated
with the neural network is given by (9).
P (H
1
)f(z|H
1
)t
H
1
+ P (H
0
)f(z|H
0
)t
H
0
P (H
1
)f(z|H
1
) + P (H
0
)f(z|H
0
)
H
1
H
0
η
net
(9)
f
0
(z) is equal to g
0
(z) for t
H
1
= 1 and t
H
0
= 1, and for minimizing the proba-
bility of miss-classification, η
net
must be set to 0. But what Ruck et al. [2], and Wan [3]
did not notice is the fact that as f
0
(z) can be expressed as a function of the likelihood
ratio, the network not only is approximating the minimum probability of error classi-
fier, but it can approximate the Neyman-Pearson detector if the detection threshold is
modified attending to probability of false alarm requirements.
The rule (9) shows that for implementing the Neyman-Pearson detector, the detec-
tion threshold for a given P
F A
is not only a function of the likelihood functions, but also
depends on the a priori probabilities and the desired outputs. These are parameters that
can be selected by the designer when generating the training set, and when determining
the activation function of the output neuron.
4 Effect of approximation errors on P
F A
The neural network will converge to an approximation of (8), so the decision rule im-
plemented can be expressed as in (10), where ∆f
0
(z, W) is the approximation error.
f
0
(z) + ∆f
0
(z, W)
H
1
H
0
η
net
(10)
This decision rule can also be expressed with (11) revealing that the effect of ap-
proximation errors can be studied as the effect of erroneous detection thresholds.
f
0
(z)
H
1
H
0
η
net
∆f
0
(z, W) (11)
In order to evaluate the decrease in P
D
due to threshold errors, the partial derivative
of P
F A
with respect to the detection threshold must be calculated. This calculus requires
the knowledge of the likelihood functions, and can be very tedious due to the complexity
of rule (9).
In practice, when designing a Neyman-Pearson detector, the first step consists in
determining the likelihood ratio for the problem to be solved, as indicates expression
(12). Before determining the detection threshold for a given P
F A
, a simpler sufficient
statistic is calculated applying a set of simplifications and using monotonic functions.
(z) =
f(z|H
1
)
f(z|H
0
)
H
1
H
0
η
cv
(12)
70
Following a similar strategy, in a first step, the rule (9) can be expressed as a function
of the likelihood ratio, as indicates expression (13).
P (H
1
) (z)t
H
1
+ P (H
0
)t
H
0
P (H
1
) (z) + P (H
0
)
H
1
H
0
η
net
(13)
The relation between η
cv
and η
net
is given in (14).
η
cv
=
P (H
0
)(η
net
t
H
0
)
P (H
1
)(t
H
1
η
net
)
(14)
The partial derivative of P
F A
with respect to η
net
can be calculated using the chain
rule (15).
P
F A
η
net
=
P
F A
η
cv
η
cv
η
net
(15)
The second factor of the right part of (15) can be calculated from (14) to obtain (16).
It depends on the a priori probabilities of the classes and the desired outputs selected
for training, factors that can be controlled by the designer.
η
cv
η
net
=
P (H
0
)(t
H
0
t
H
1
)
P (H
1
)(t
H
1
η
net
)
2
(16)
From the analysis of expression (16) the following conclusions can be extracted:
The function approximated by the neural network has been expressed as a function
of the likelihood ratio. When this ratio is greater than or equal to η
cv
, we decide that
hypothesis H
1
is true, and if it is lower than η
cv
, we decide in favor of hypothesis
H
0
. So, t
H
1
must be greater than t
H
0
.
η
net
takes values between t
H
1
and t
H
0
, so η
net
t
H
1
.
From the previous two points, we can conclude that the partial derivative of η
cv
with respect to η
net
is always positive.
For η
net
values closed to t
H
1
, that is, for very low P
F A
values,
η
CV
η
net
is very high
(for η
net
= t
H
1
tends to infinity). We can try to compensate it in some degree by
increasing the difference between the desired outputs during training, or construct-
ing training sets where features vectors from hypothesis H
1
are more likely than
those from hypothesis H
0
.
The first factor of the second part of (15) depends on the likelihood functions of the
problem to be solved. Its value is calculated in (17).
P
F A
η
cv
=
η
cv
h
1
Z
η
cv
−∞
f((z)|H
0
)d((z))
i
= f((z)|H
0
)|
(z|H
0
)=η
cv
(17)
To gain an insight into the influence of
P
F A
η
cv
, we follow the strategy of looking for
a simpler test statistic. If we denote this new statistic as z(z) and the corresponding
detection threshold as η
s
, the decision rule can be expressed as in (18).
71
z(z)
H
1
H
0
η
s
(18)
The relation between η
s
y η
cv
is determined by the relation that exist between the
likelihood ratio and the selected statistic, so it is known. Expression (15) can be re-
written as a function of η
s
.
P
F A
η
net
=
P
F A
η
s
η
s
η
cv
η
cv
η
net
(19)
Expression (19) shows that the partial derivative of P
F A
with respecto to η
net
can
be expressed as the product of three factors:
The first factor,
P
F A
η
s
, is a characteristic of the problem to be solved.
The second factor,
η
s
η
cv
, also is characteristic of the problem to be solved.
The third factor,
η
cv
η
net
, not only depends on the a priori probabilities of the classes
and the desired outputs selected from training, because the value of η
net
required
for a given P
F A
depends on the problem to be solved.
The usefulness of adding a new factor in (15) only can be proved if a particular case
is considered. In the next section, the problem of detecting gaussian signals in gaussian
interference in considered.
5 A case study: Detection of gaussian signals in gaussian
interference
The problem of detecting gaussian signals in gaussian interference is considered. The
feature vector is composed by n independent gaussian samples of zero mean and unity
variance under hypothesis H
0
, and zero mean and a variance σ
2
s
+ 1 under hypothesis
H
1
. The signal-to-noise ratio is defined in (20) and the value selected for constructing
the training set is denoted as tsnr.
SN R = 10log(snr) = 10log(σ
2
s
) (20)
For a given tsnr, the likelihood functions are expressed in (21) and (22); the likeli-
hood ratio and the corresponding detection rule are given by (23).
f(z/H
0
) =
1
p
(2π)
n
exp
³
1
2
z
T
z
´
(21)
f(z/H
1
) =
1
p
(2π)
n
(tsnr + 1)
n
exp
h
1
2(tsnr + 1)
z
T
z
i
(22)
(z) =
1
(tsnr + 1)
n
2
exp
h
tsnr
2(tsnr + 1)
z
T
z
i
H
1
H
0
η
cv
(23)
A simpler sufficient statistic can be obtained applying logarithms and re-arranging
terms:
72
z(z) = z
T
z
H
1
H
0
2
tsnr + 1
tsnr
ln[η
cv
(tsnr + 1)
n
2
] = η
s
(24)
As the likelihood function under hypothesis H
0
does not depend on tsnr, the prob-
ability density function of z(z) does not depend on it, and for a given P
F A
, η
s
is inde-
pendent on tsnr. Because of that, the performance of the Neyman-Pearson detector is
independent on the tsnr value.
Although the ROC curves and
P
D
P
F A
do not depend on tsnr,
P
F A
η
net
and the sensi-
bility of the neural detector to approximation errors depends on it.
The partial derivative of η
s
with respect to η
cv
is given in (25).
η
s
η
cv
=
2(1 + tsnr)
η
cv
tsnr
(25)
z(z|H
0
) is a chi-square random variable with n degrees of freedom. The partial
derivative of P
F A
with respect to η
s
is calculated in (26).
P
F A
η
s
=
1
2
n
2
(
n
2
1)!
η
(
s
n
2
1) exp(
η
s
2
) (26)
Combining expressions (16), (25) and (26) the partial derivative of P
F A
with respect
to η
net
has been calculated in (27).
P
F A
η
net
=
1
2
n
2
(
n
2
1)!
η
(
n
2
1)
s
exp(
η
s
2
)
2(1 + tsnr)
η
cv
tsnr
P (H
0
)(t
H
1
t
H
0
)
P (H
1
)(t
H
1
η
net
)
2
(27)
6 Conclusions
In this paper, the application of neural networks for approximating the Neyman-Pearson
detector is considered. We propose the calculus of the partial derivative of the probabil-
ity of false alarm with respect to the detection threshold, as a tool to identify the training
parameters that can be controlled for reducing the effect of approximation errors over
the performance of the neural network based detector.
As a first step, the function approximated for a neural network trained using the
mean squared-error criterion is deduced. Without imposing any restriction on the prior
probabilities of the classes and on the desired outputs selected for training, we obtain
a general expression that reveals that these parameters play an important role in con-
trolling the sensibility of the neural network detector performance to approximation
errors.
In previous works, the signal-to-noise ratio selected for training appeared as a crit-
ical design parameter, but no effort has been done so as to explain the dependence of
the neural network based detector on this parameter. In this paper, we explain this de-
pendence and provide an strategy to determine the best tsnr value when the statistical
properties of the feature vectors are known.
73
References
1. Van Trees, H.L.: Detection, estimation, and modulation theory, Vol. 1. Wiley, (1968)
2. Ruck, D.W., Rogers, S.K., Kabrisky, M., Oxley, M.E., Suter, B.W.: The multilayer perceptron
as an aproximation to a Bayes optimal discriminant function. IEEE Transactions on Neural
Networks, vol. 1, no. 4, 296–298, December 1990.
3. Wan, E.A.: Neural Network Classification: A Bayesian Interpretation. IEEE Transactions on
Neural Networks, vol.1, no. 4, pp. 303-305, December 1990.
4. Gandhi, P.P., Ramamurti, V.: Neural networks for signal detection in non-gaussian noise. IEEE
Transactions on signal processing, vol. 45, no. 11, pp. 2846-2851, November 1997.
5. Andina, D., Sanz-Gonz=E1lez, J.L.: Comparison of a neural network detector vs. Neyman-
Pearson optimal detector. Proceedings of the 1996 IEEE International Conference on Acous-
tics, Speech and Signal Processing, vol. 6, pp. 3573-3576, 1995.
6. Jarabo-Amores, P., Rosa-Zurera, M., L
´
opez Ferreras, F.: Design of a Pre-processing Stage for
Avoiding the Dependence on TSNR of a Neural Radar Detector. Lecture Notes in Computer
Sciences, Vol. 2085. Springer-Verlag, Berlin Heidelberg New York (2001) 652–659,
74