AHP-Based

Classiﬁer Combination

aszl

o Felf

oldi and Andr

as Kocsor

Department of Informatics, University of Szeged, H-6720 Szeged,

Arp

ad t

er 2., Hungary

Research Group on Artiﬁcial Intelligence of the Hungarian Academy of Sciences and

University of Szeged, H-6720 Szeged, Aradi v

ertan

uk tere 1., Hungary

Abstract. Classiﬁer combinations are effective techniques for difﬁcult pattern

recognition problems such as speech recognition where the combination of dif-

ferently trained classiﬁers can produce a more robust phoneme classiﬁcation on

noisy datasets. In this paper we investigate traditional linear combination schemes

(e.g. arithmetic mean and least squares methods), and propose a new combiner

based on the Analytic Hierarchy Process (AHP), a method frequently applied

in mathematical psychology and multi-criteria decision making. In addition, we

experimentally compare the applicability of these linear combination schemes

using neural network classiﬁers on a speech recognition framework and two test

sets from the UCI repository.

1 Introduction

In pattern recognition problems [1][2][3] the main aim is to construct a classiﬁer (or

inducer) in order to model the behavior of a system. For each pattern of a pattern space

the classiﬁer has to select a class label from the set of available labels. The construction

of a classiﬁer (i.e. the learning process) is based on a set of labelled examples, and

strongly depends on prior knowledge about the pattern space and the characteristics

of the given examples. Given inﬁnite training data, consistent classiﬁers approximate

the Bayesian decision boundaries to arbitrary precision, therefore providing a similar

generalization. However, often only a limited portion of the pattern space is available

or observable. Given a ﬁnite and noisy data set, different classiﬁers typically provide

different generalizations. It is thus necessary to train more networks when dealing with

classiﬁcation problems to ensure that a good model or parameter set is found. However,

selecting such a classiﬁer is not necessarily the ideal choice since potentially valuable

information may be wasted by discarding the results of the other classiﬁers. In order

to avoid this kind of loss of information, the output of all available classiﬁers should

be examined for the ﬁnal decision. This approach is particularly useful for difﬁcult

problems such as those that involve a large amount of noise.

A fair number of combination methods have been proposed in the literature [4]

which have proved effective in improving classiﬁer performance. In this paper we focus

on linear combination schemes, especially averaging techniques. Experimental studies

have shown that linear classiﬁer combinations can improve the recognition accuracy.

Felföld L. and Kocsor A. (2004).

AHP-Based Classiﬁer Combination.

In Proceedings of the 4th International Workshop on Pattern Recognition in Information Systems, pages 45-58

DOI: 10.5220/0002680200450058

 SciTePress

Tumer and Ghosh showed that combining networks using single averaging reduces the

variance of the actual decision boundaries around the optimum boundary [5]. Later

Fumera and Roli extended the theoretical framework for the weighted averaging rule

[6]. However, these theoretical explanations apply very restrictive assumptions, hence

they cannot show the real performance of the combiner. To improve the accuracy of the

averaging techniques, we propose a new combination method based on the Analytical

Hierarchy Process, and demontrate that it may produce a more robust classiﬁcation

performance in tests

The paper is organized as follows. In the next section we give a brief overview of

linear combination schemes, including averaging techniques, and examine the theoret-

ical background of these systems. The third section introduces the Analytic Hierarchy

Process and shows utilization facilities provided in combination frameworks. The ex-

perimental section then compares the performance of various linear schemes on differ-

ent database and classiﬁer sets. Finally, we give some brief conclusions and ideas for

future research.

2 Linear Classiﬁer Combinations

Fig.1. General parallel combination scheme

Combiners aggregate the outputs of different classiﬁers to make a ﬁnal decision.

This aggregation depends on the kind of information that the individual classiﬁers can

supply. From this viewpoint the classiﬁcation methods can be grouped into three main

categories:

1. Measurement or Conﬁdence. The classiﬁer can model probability values for each

class label. Let f

(x) denote the output of classiﬁer C

for class i and pattern x.

The linear combination method can be described by the formula

(x) =

j=1

(x), (1)

where

(x) is the combined class conditional probability, and w

is the weighting

factor of classiﬁer C

. The ﬁnal decision can be obtained by selecting the class with

the greatest probability, in accordance with the Bayesian decision principle.

2. Ranking. The classiﬁer can produce a list of class labels in the order of their prob-

abilities. The combined position ˆg

is then computed via the formula

ˆg

(x) =

j=1

(x), (2)

where g

(x) is the position of the label i in the classiﬁcation result of pattern x

obtained by the classiﬁer C

. With the selection of a proper monotonic decreasing

function u, and

(x) = u(g

(x)), (3)

the ranking-type combination can be reduced to the conﬁdence-type scheme.

3. Abstract. The classiﬁer supplies only the most probable class label. In this case the

combination relies on the majority voting formula

L(x) = arg max

(x)=i

, (4)

where L

(x) is the index of the class label calculated for pattern x. Taking the

selection of f

(x) as

(x) =

1 L

(x) = i

0 otherwise

(5)

leads to the reformulation of voting as a conﬁdence-type combination much like

that for the ranking type.

As we have shown, the output of classiﬁers belonging to the Ranking or Abstract class

can be transformed to class conditional probability values. Therefore, in the follow-

ing section we shall deal only with the conﬁdence-type combination, and expect the

combiners to supply the class conditional probabilities.

We commence our study of linear combinations with a theoretical investigation, and

then show how to apply them in practice.

2.1 Theoretical investigation

As mentioned previously, the output of the classiﬁers are expected to approximate the

corresponding a posteriori probabilities if they are reasonably well trained. Thus the

decision boundaries obtained using this kind of classiﬁer are close to Bayesian decision

boundaries. Tumer and Ghosh developed a theoretical framework for analyzing the av-

eraging rule of linear combinations [5]. Later Roli and Fumera extended this concept

by examining the weighted averaging rule [6].

We shall focus on the classiﬁcation performance near these decision boundaries.

Consider the boundary between class i

and i

. The output of the classiﬁer is

(x) = p

(x) + ²

(x) (6)

where p

(x) is the real a posteriori probability of class i, and ²

(x) is the estimation

error. Let us assume that the class boundary x

obtained from the approximated a pos-

teriori probabilities

) = f

) (7)

are close to the Bayesian boundaries x

∗

, i.e.

∗

) = p

∗

), (8)

for boundary between class i

and i

. Additionally assuming that the estimated errors

(x) on different classes are independently and identically distributed (i.i.d.) variables

with zero mean, Tumer and Ghosh showed that the expected error of classiﬁcation can

be expressed as:

E = E

Bayes

+ E

add

, (9)

where E

Bayes

is the error of a classiﬁer with the correct Bayesian boundary. The added

error E

add

can be expressed as:

add

sσ

, (10)

where σ

is the variance of b,

b =

) − ²

)

, (11)

and s is a constant term depending on the derivatives of the probability density functions

at the optimal decision boundary.

Let us consider the effect of combining multiple classiﬁers. We shall deal only with

linear combinations, so we have the following combined probabilities:

(x) =

j=1

(x), (12)

where f

denotes the output of the classiﬁer C

for the class i. Assuming normalized

weights, i.e.

j=1

= 1, w

≥ 0 (13)

we have that

(x) = p

(x) + ˆ²

(x), (14)

where

ˆ²

(x) =

j=1

(x) (15)

Let us compute the variance of

b, where

b =

ˆ²

− ˆ²

. (16)

Assuming that ²

(x) are i.i.d. variables with zero mean and variance σ

, the errors of

different classiﬁers on the same class are correlated, while on different classes they are

uncorrelated.

Cov(²

), ²

)) =

when i

= i

0 otherwise

where ρ

denotes the covariance between the errors of classiﬁer C

and C

for each

class. Expanding the tag σ

in Eq. (10), we ﬁnd that

add

j=1

m6=n

Expressed in terms of additional errors of the classiﬁers

add

j=1

add

m6=n

where the term E

add

denotes the added error of the classiﬁer C

. In the case of uncor-

related estimation errors (i.e. when ρ

= 0), this equation reduces to

add

j=1

add

, (17)

which leads to the following optimal values for w:

= c

add

, (18)

where c is a normalization factor. With equally performing classiﬁers, that is when all

the errors E

add

have the same value

add

= E

add

, j = 1, . . . , N, (19)

we obtain the Simple Averaging rule:

, (20)

which results in the following error value:

add

. (21)

This formula shows that, under the conditions mentioned, linear combinations reduce

the error of the individual classiﬁer. Considering correlated errors, however, does not

lead to a simple general expression for optimal values of weights, and other methods

are required to estimate the optimal parameters.

2.2 Construction

To achieve the best combination performance the parameters of the combiner can be

trained on a selected training data set. The form of linear combinations we deal with is

quite simple (see Figure 1), the trainable parameters being just the weights of classiﬁers.

Thus the various linear combinations differ only in the values of these factors. In the

following we give some examples of methods commonly employed, and in the next

section we propose a new method for computing these parameters.

1. Simple Averaging. In this simplest case, the weights can be selected so they have

the same value:

. (22)

As mentioned before, this selection is optimal when all the classiﬁers have a very

similar performance, and all the assumptions mentioned in Section 2.1 hold.

2. Weighted Averaging. Experiments show [6] that Simple Averaging can be out-performed

when selecting weights to be inversely proportional to the error rate of the corre-

sponding classiﬁer:

, (23)

where E

is the error rate of the classiﬁer C

, i.e. the ratio of the number of correctly

classiﬁed patterns and total number of patterns on a selected data set for training

the combiner. This rule (a more general version of the Simple Averaging rule) is

employed in order to handle the combination of unequally performing classiﬁers.

3. Least-squares methods. To calculate the weights w

one can take those values that

minimize the distance between the computed and estimated class conditional prob-

abilities:

min

x∈X

i=1





j=1

(x) − p

(x)





, (24)

where

X = {x

, . . . , x

}

is the training set of patterns for the combination, and p

(x) is the estimated class

conditional probability function. In the case of supervised learning the class labels

are available for all the training patterns, but there is no direct way of comparing

this information with computed a posteriori probabilities. Assuming a Bayesian

decision, the pattern x has the label L(x) = i when p

(x) ≥ p

(x) for all i 6= j.

Let us approximate p

(x) by

(x) =

1 if L(x) = i

0 otherwise

, (25)

or calculate it from the error correlation matrix of the n combiners [7]. Using matrix

notations the problem described in Eq. (24) can be expressed as:

min

(Aw − b)

(Aw − b), (26)

where

j,(mi+k)

= f

mi+k

= p

This optimization problem leads to the following linear equation:

A w = A

b, (27)

which provides a way of determining the weighting factors.

3 The Analytic Hierarchy Process

Analytic Hierarchy Process is an intuitive and efﬁcient method for multi-criteria decision-

making (MCDM) applications [8]. The structure of a typical decision problem (see Fig.

2) consists of alternatives and decision criteria. Each alternative can be evaluated in

terms of the decision criteria and, in the case of multilevel criteria, the relative impor-

tance (or weight) of each criterion can also be estimated.

Fig.2. A simple application of AHP

In the following we give a brief summary of the fundamental properties of the AHP,

and propose a combination method based on it.

3.1 Mathematical model

The ﬁrst step of AHP is to divide the decision problem into sub-problems, which are

structured into hierarchy levels. The number of levels depends on the complexity of the

initial problem. The leaves contain the possible alternatives and the inner nodes repre-

sent the criteria. To compute the importance of possible choices, pairwise comparison

matrices are utilized for each criterion. The element a

of the comparison matrix A

represents the relative importance of choice i against the choice j, implying that the

element a

is the reciprocal of a

. Let the importance value v of choice y be expressed

as a linear combination of the importance values for each applied criterion:

v(y) =

j=1

v(y

), (28)

where w

is the importance of choice y with respect to the criterion y

. Using compar-

ison matrices AHP propagates the importance values of each node from the topmost

criteria towards the alternatives, and selects the alternative with the greatest importance

value as its ﬁnal decision.

Let us now focus on the computation of the weights w for a selected criterion. The

elements of a given pairwise comparison matrix approximate the relative importance of

the choices, thus

≈

, (29)

where the elements of the unknown vector w are the importance values. A matrix M is

called consistent if its components satisfy the following equalities:

, (30)

and

= m

∀ i, j, k. (31)

If A is not consistent, it is not possible to ﬁnd a vector w that satisﬁes the equation

. (32)

Now let us deﬁne the matrix of weight ratios by

W =







· · ·







, (33)

or, in matrix notation,

W = ww

. (34)

Note that Eqs. (30) and (31) hold for the matrix W :

= w

, (35)

hence the matrix of weight ratios is consistent.

Because the rows of matrix W are linearly dependent, the rank of the matrix is 1,

and there is only one nonzero eigenvalue. Knowing that the trace of a matrix is invariant

under similarity transformations, the sum of diagonal elements is equal to the sum of

eigenvalues, which implies that the nonzero eigenvalue λ

max

equals the number of the

rows:

max

= n. (36)

It is straightforward to check that the vector w is an eigenvector of matrix W corre-

sponding to the maximum eigenvalue

(W w)

j=1

i=1

j=1

= nw

The aim of AHP is to resolve the weight vector w from a pairwise comparison

matrix A, where the elements of A corresponds to the measured or estimated weight

ratios. Following Saaty we shall assume that

> 0, (37)

and

. (38)

From matrix theory it is known that a small perturbation of the coefﬁcients implies a

small perturbation of the eigenvalues. Hence we still expect to ﬁnd an eigenvalue close

to n, and select the elements of the corresponding eigenvector as weights. It can be

proved that

max

≥ n,

and the matrix A is consistent if and only if λ

max

= n. A way of measuring the

consistency of the matrix A is by deﬁning the consistency index (CI) as the negative

average of the remaining eigenvalues:

CI =

λ<λ

max

n − 1

max

− n

n − 1

(39)

3.2 Combination using AHP

As mentioned above, AHP provides the following solution for the problem of linear

MCDM systems:

v(y) =

i=1

v(y

where the importance value of the choice is the linear combination of the importance

values of the direct criteria. In linear classiﬁer combinations the combined class condi-

tional probabilities are computed as weighted sums of the probability values from each

classiﬁer, so

(x) =

j=1

(x).

Noting the similarities between these two methods, it is clear that, by applying pair-

wise comparisons on classiﬁers performance, AHP provides a way of computing the

weights of inducers in classiﬁer combinations. Let us calculate the element a

of the

comparison matrix as the quotient of classiﬁcation performance on a selected test data

set:

, (40)

where E

is the classiﬁcation error of classiﬁer C

. If all the performance errors are

measured on the same test data set, the comparison matrix A is consistent, and the

elements of the eigenvector whose corresponding eigenvalue is N, that is

, (41)

are the same those as generated by weighted averaging. However, this method allows us

to make pairwise comparisons of different inducers applied on different (e.g. randomly

generated) test sets, taking advantage of the stabilizing effect of AHP. This leads to

more a robust classiﬁcation performance, especially in noisy environments, as shown

in the experimental section.

4 Experiments

In this section we describe our experiments for comparing the performance of averaging

combiners and our AHP-based combiner.

4.1 Evaluation Domain

In the experiments three data sets were employed: a data-set used in our speech recog-

nition system, and two other datasets (letter and satimage) originating from the stat-

log/UCI repository. (http://www.liacc.up.pt/ML/statlog)

1. Speech data set [9]. The database is based on recorded samples taken from 160

children aged between 6 and 8. The ratio of girls and boys was 50% - 50%.The

speech signals were recorded and stored at a sampling rate of 22050 Hz in 16-bit

quality. Each speaker uttered all the Hungarian vowels, one after the other, sepa-

rated by a short pause. Since we decided not to discriminate their long and short

versions, we only worked with 9 vowels altogether. The recordings were divided

into a train and a test set in a ratio of 50% - 50%. There are numerous methods for

obtaining representative feature vectors from speech data, but their common prop-

erty is that they are all extracted from 20-30 ms chunks or frames of the signal in

5-10 ms time steps. The simplest possible feature set consists of the so-called bark-

scaled ﬁlterbank log-energies (FBLE). This means that the signal is decomposed

with a special ﬁlterbank and the energies in these ﬁlters are used to parameterize

speech on a frame-by-frame basis. In our tests the ﬁlters were approximated via

Fourier analysis with a triangular weighting. Altogether 24 ﬁlters were necessary

to cover the frequency range from 0 to 11025 Hz. Although the resulting log-energy

values are usually sent through a cosine transformation to obtain the well-known

mel-frequency cepstral coefﬁcients (MFCC), we abandoned it because, as we ob-

served earlier, the learner we work with is not sensitive to feature correlations so a

cosine transformation would bring no signiﬁcant improvement.

2. Letter Data Set. The objective of the Data set is to identify each of a large number

of black-and-white rectangular pixel displays as one of the 26 capital letters in

the English alphabet. The character images were based on 20 different fonts and

each letter within these 20 fonts was randomly distorted to produce a ﬁle of 20,000

unique stimuli. Each stimulus was converted into 16 primitive numerical attributes

(statistical moments and edge counts) which were then scaled to ﬁt into a range of

integer values from 0 to 15. We typically trained on the ﬁrst 16000 items and then

used the resulting model to predict the letter category for the remaining 4000.

3. Satimage Data Set. One frame of Landsat MSS imagery consists of four digital

images of the same scene in different spectral bands. Two of these are in the visible

region (corresponding approximately to green and red regions of the visible spec-

trum) and two are in the (near) infra-red. Each pixel is a 8-bit binary word, with 0

corresponding to black and 255 to white. The spatial resolution of a pixel is about

80m x 80m. Each image contains 2340 x 3380 such pixels. The database is a (tiny)

sub-area of a scene, consisting of 82 x 100 pixels. Each line of data corresponds

to a 3x3 square neighborhood of pixels completely contained within the 82x100

sub-area. Each line contains the pixel values in the four spectral bands (converted

to ASCII) of each of the 9 pixels in the 3x3 neighborhood and a number indicating

the classiﬁcation label of the central pixel. The number of possible class labels for

each pixel is 7. We trained on the ﬁrst 4435 patterns of the database and selected

the remaining 2000 patterns for testing.

4.2 Evaluation Method

During the experiments we compared the performance of 6 different combiners applied

on each of the 3 databases. For each of the databases we trained 3-layered neural net-

works with different structures, and selected 5 subsets of classiﬁers, denoted by Set1

to Set5.

In the case of the speech and letter databases we trained networks setting the number

of neurons in the hidden layer to 5, 10, 20, 40, 80, 160, and 320. Table 1 shows the

construction of classiﬁer sets. The columns refer to the number of hidden neurons, and

the rows show which networks belong to the selected classiﬁer sets.

5 10 20 40 80 160 320

Set1 x x x x x x x

Set2 x x x x x x

Set3 x x x x x

Set4 x x x x

Set5 x x x

Table 1. Classiﬁer sets for the speech and letter databases.

The satimage database contains only 7 classes. In this case we trained networks with

hidden layer size selected to 2, 5, 10, 20, 40, 80, and 160. The corresponding classiﬁer

selection is displayed in Table 2.

2 5 10 20 40 80 160

Set1 x x x x x x x

Set2 x x x x x x

Set3 x x x x x

Set4 x x x x

Set5 x x x

Table 2. Classiﬁer sets for the satimage database.

The experiments compared the performance of linear combination schemes with

different methods for acquiring the proper weights. We examined 2 schemes of Aver-

aging : SA and W A (i.e. simple and weighted) averaging, and 4 schemes of AHP.

To calculate the pairwise comparison matrices needed for the AHP method, we took

the quotient of classiﬁcation errors of the two competing networks on a random test

set generated by bootstrapping (resampling the training data set with replacement) of

the training set. In accordance with the size of the generated test set we had 4 AHP

schemes, AHP 1 to AHP 4, setting the size of each to 50, 100, 200 and 400, respec-

tively. With the W A combiner, the original training set was selected for the calculation

of the weights.

Set1 Set2 Set3 Set4 Set5

SA 8.52 9.26 9.95 9.77 8.80

W A 8.66 9.21 9.91 9.81 8.75

AHP

8.66 8.94 9.44 10.05 8.80

AHP

8.56 9.12 9.72 10.19 8.80

AHP

8.61 9.21 9.49 9.58 8.70

AHP

8.61 9.31 9.55 9.07 8.70

Table 3. Classiﬁcation errors [%] on the Speech database (Error without combination: 12.92%)

Set1 Set2 Set3 Set4 Set5

SA 8.70 8.34 7.88 7.26 7.78

W A 7.56 7.64 7.64 7.06 7.68

AHP

6.84 7.26 7.04 6.76 7.48

AHP

6.74 6.90 6.98 6.82 7.56

AHP

6.67 6.94 6.96 6.80 7.58

AHP

6.78 7.00 6.88 6.82 7.54

Table 4. Classiﬁcation errors [%] on the Letter database (Error without combination: 13.78%)

4.3 Results and Discussion

Tables 3, 4, and 5 show the results of the experiments. Columns represent the various

classiﬁer sets, while rows show the classiﬁcation errors measured using the selected

combination of the corresponding classiﬁer group.

As expected, all the combinations here improved the generalization performance of

the simple classiﬁer. In almost every case AHP-based combinations outperformed the

weighed averaging combinations. However, in some cases SA performed better than

W A and the AHP combiners, showing that the strong assumptions of the method are

not always satisﬁed [10].

The performance of AHP combiners depends on the size of testing set. When the

test sets selected were too small, the measured accuracy values did not characterize the

goodness of the classiﬁers, and yielded wrong combination results. Increasing the size,

however, makes the consistency index (CI) tend to zero, producing weights that tend

to the values calculated by weighted averaging. Determining the optimal size of the test

set requires further study.

When considering the sensitivity for the selection of different classiﬁer subsets,

the AHP-based combiner has a behavior similar to that of the W A method, hence the

optimal classiﬁer set can be selected by methods available for the averaging combiners

[6].

Lastly, we should say that the AHP-based combination scheme is a good tool for

making the solution of the classiﬁcation problem more accurate and reliable.

Set1 Set2 Set3 Set4 Set5

SA 10.95 10.35 10.35 10.00 10.05

W A 10.50 10.35 10.45 9.90 10.00

AHP

10.05 9.95 10.05 9.60 9.30

AHP

10.00 9.70 9.45 9.50 9.50

AHP

9.80 9.50 9.50 9.20 9.45

AHP

9.90 9.55 9.75 9.30 9.50

Table 5. Classiﬁcation errors [%] on the Satimage database. (Error without combination: 12.05%)

5 Conclusion

In this paper we proposed a new linear combination method, and compared its per-

formance with those of other combiners. As shown in the experiments, AHP-based

combinations proved an effective generalization of the weighted averaging rule; they

outperformed the other averaging methods in almost every case.

Finally, we should mention that resampling techniques may further improve the

performance of linear combinations, hence it is worth investigating generalizations of

the bootstrapped learners Bagging [11] and Boosting [12] in future work.

References

1. Anil K. Jain, Robert P. W. Duin, and Jianchang Mao. Statistical pattern recognition: A review.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):4–37, 2000.

2. V. N. Vapnik. Statistical Learning Theory. John Wiley and Son, 1998.

3. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classiﬁcation. John Wiley and Son, New

York, 2001.

4. L. Xu, A. Krzyzak, and C.Y. Suen. Method of combining multiple classiﬁers and their

application to handwritten numeral recognition. IEEE Trans. on SMC, 22(3):418–435, 1992.

5. K. Tumer and J. Ghosh. Analysis of decision boundaries in linearly combined neural classi-

ﬁers. Pattern Recognition, 29:341–348, 1996.

6. F. Roli and G. Fumera. Analysis of linear and order statistics combiners for fusion of imbal-

anced classiﬁers. In 3rd Int. Workshop on Multiple Classiﬁer Systems (MCS 2002), Cagliari,

Italy, June 2002. Springer-Verlag, LNCS.

7. M. P. Perrone and L. N. Cooper. When networks disagree: Ensemble methods for hybrid neu-

ral networks. In R. J. Mammone, editor, Neural Networks for Speech and Image Processing,

pages 126–142. Chapman-Hall, 1993.

8. T. L. Saaty. The Analytic Hierarchy Process. McGraw-Hill, New York, 1980.

9. A. Kocsor and L. T

oth. Kernel-based feature extraction with a speech technology application.

In IEEE TRANS. ON SIGNAL PROCOCESSING, 2003.

10. G. Fumera and F. Roli. Linear combiners for classiﬁer fusion: Some theoretical and experi-

mental results. In 4th Int. Workshop on Multiple Classiﬁer Systems (MCS 2003), Guildford,

UK, January 2003. Springer-Verlag, LNCS.

11. Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.

12. Freund. Boosting a weak learning algorithm by majority. In Proceedings of the Workshop

on Computational Learning Theory (COLT 1990). Morgan Kaufmann Publishers, 1990.