for the dichotomic data classification task, where z = (x,t) are data pairs and F(z) is the
data distribution. R(w) is then the NN probability of error, P
e
(w).
In practice, the NN is designed so to minimize the empirical error, i.e., based on a
training set D
n
= {z
i
= (x
i
, t
i
): x
i
∈X, t
i
∈T, i = 1,2,…,n}, we pick up the mapping (by suit-
able NN weight adjustment) that minimizes
∑
=
=
n
i
i
wzQ
n
wR
1
emp
),(
1
)( .
The minimum empirical risk occurs for a weight vector w
n
. )(
emp n
wR is then the resub-
stitution estimate of the probability of error,
e
P
, of the NN classifier.
Statistical Learning Theory [9] shows that the necessary and sufficient condition
for the consistency of the ERM principle
,
, independently of the probability distribution
(independently of the problem to be solved), is:
0
)(
lim =
∞→
n
nG
n
where G(n), called growth function, can be shown to either satisfy
G(n) = n ln2 if n ≤ h
or to be bounded by:
( )
+≤
≤
∑
=
h
n
hnG
h
i
n
i
ln1ln)(
0
if n > h.
The parameter h, called the Vapnik-Chervonenkis dimension (or VC-dimension for
short) is the maximum number of vectors z
1
,…, z
h
, which can be separated in all 2
h
pos-
sible ways using the family of functions φ(z) implemented by the neural network (shat-
tered by that family). As an example, consider a single perceptron with x ∈ ℜ
2
that
uses a step function as activation function. The perceptron implements a line in 2-
dimensional space. Therefore, the determination of h amounts to determining the
maximum cardinality of a set of points that can be separated in all 2
h
possible ways by
any line. In this simple case h = 3. For more complex situations there are no exact for-
mulas of h and one has to be content with bounding formulas.
Let C be a class of possible dichotomies of the data inputs. The neural network at-
tempts to learn a dichotomy c ∈ C using a function from a family of functions Φ im-
plemented by the neural network. Using an appropriate algorithm, the neural network
learns c ∈ C with an error probability (true error rate) ε (≡ P
e
) with confidence δ. A
theorem presented in [3] establishes the PAC-learnability
1
of the class C only if its VC
1
C is PAC-learnable by an algorithm using D
n
and Φ , if for any data distribution and ∀ ε, δ, 0 <
ε, δ < 0.5 the algorithm is able to determine a function φ ∈ Φ , such that P(P
e
(φ) ≤ ε) ≥ 1−δ, in
polynomial time in 1/ε, 1/δ, n and size(c). (See [1], [5], [7])
197