Learning on Vertically Partitioned Data based on Chi-square Feature

Selection and Naive Bayes Classiﬁcation

Ver´onica Bol´on-Canedo, Diego Peteiro-Barral, Amparo Alonso-Betanzos, Bertha Guijarro-Berdi˜nas

and Noelia S´anchez-Maro˜no

Department of Computer Science, University of A Coru˜na, Campus de Elvi˜na s/n, A Coru˜na 15071, Spain

Keywords:

Feature Selection, Classiﬁcation, Distributed Learning.

Abstract:

In the last few years, distributed learning has been the focus of much attention due to the explosion of big

databases, in some cases distributed across different nodes. However, the great majority of current selection

and classiﬁcation algorithms are designed for centralized learning, i.e. they use the whole dataset at once. In

this paper, a new approach for learning on vertically partitioned data is presented, which covers both feature

selection and classiﬁcation. The approach splits the data by features, and then uses the χ

ﬁlter and the

naive Bayes classiﬁer to learn at each node. Finally, a merging procedure is performed, which updates the

learned model in an incremental fashion. The experimental results on ﬁve representative datasets show that the

execution time is shortened considerably whereas the classiﬁcation performance is maintained as the number

of nodes increases.

1 INTRODUCTION

In the last decades, the dimensionality of the datasets

involved in data mining has increased dramatically

until the size of zetabytes. In fact, if one analyzes the

dimensionality of the datasets posted in the UC Irvine

Machine Learning Repository (Frank and Asuncion,

2010) from 1987 to 2010, this circumstance can be

conﬁrmed (Zhao and Liu, 2011). Dimensionality can

be deﬁned as the product of the number of samples

and features. In the 1980s, the maximal dimensional-

ity of the data was about 100; then in the 1990s, this

number increased to more than 1500; and ﬁnally in

the 2000s, it further increased to about 3 million. The-

oretically, having such a high amount of data avail-

able could lead to better results, but this is not always

the case due to the so-called curse of dimensional-

ity (Bellman, 1966). This phenomenon happens when

the dimensionality increases and the time required for

training the machine learning algorithm on the data

increases exponentially. To overcome these problems,

feature selection is a well-known dimensionality re-

duction technique.

Feature selection consists of detecting the relevant

features and discarding the irrelevant and the redun-

dant ones (Guyon et al., 2006). A correct selection of

the features can lead to an improvement of the induc-

tive learner, either in terms of learning speed, gener-

alization capacity or simplicity of the induced model.

Feature selection, since it is an important activity in

data preprocessing, has been an active research area

in the last decade, ﬁnding success in many different

real world applications (Bol´on-Canedo et al., 2011;

Forman, 2003; Saari et al., 2011; Saeys et al., 2007).

Most of the feature selection methods belong to

one of the three following types: ﬁlters, wrappers and

embedded methods. While wrapper models involve

optimizing a predictor as part of the selection process,

ﬁlter models rely on the general characteristics of the

training data to select features with independence of

any predictor. The embedded methods generally use

machine learning models for classiﬁcation, and then

an optimal subset of features is built by the classi-

ﬁer algorithm. Nevertheless, the most common ap-

proach in feature selection is the ﬁlter model and will

be the approach selected for this research, As stated

in (Saeys et al., 2007), even when the subset of fea-

tures is not optimal, ﬁlters are preferable due to their

computational and statistical scalability.

Most existing feature selection techniques are de-

signed to run in a centralized computing environ-

ment. Traditionally, it is assumed that all data can

be held in the memory or, at least, all data are stored

in one central storage space. However, in the con-

temporary world, huge databases are developed and

maintained in meteorological, ﬁnancial, medical, in-

350

Bolón-Canedo V., Peteiro-Barral D., Alonso-Betanzos A., Guijarro-Berdiñas B. and Sánchez-Maroño N..

Learning on Vertically Partitioned Data based on Chi-square Feature Selection and Naive Bayes Classiﬁcation.

DOI: 10.5220/0004759503500357

In Proceedings of the 6th International Conference on Agents and Artiﬁcial Intelligence (ICAART-2014), pages 350-357

ISBN: 978-989-758-015-4

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

dustrial or science domains (Czarnowski, 2011), and

traditional centralized techniques are not ﬁt to effec-

tively deal with such massive datasets. Not only is the

large size of the datasets the problem facing feature

selection, but also that the datasets may be geographi-

cally, physically or logically distributed. In both these

situations, distributed feature selection techniques are

often required. In this manner, allocating the learn-

ing process among several nodes is a natural way of

scaling up learning algorithms at the same time that

it allows to deal with datasets that are naturally dis-

tributed. There are two common types of data distri-

bution: (a) horizontal distribution, where data are dis-

tributed in subsets of instances; and (b) vertical dis-

tribution, where data are distributed in subsets of fea-

tures. The great majority of approaches distribute the

data horizontally (Chan et al., 1993; Ananthanarayana

et al., 2000; Tsoumakas and Vlahavas, 2002) when

datasets are too large for batch learning in terms of

samples. While not common, there are some other de-

velopments that distribute the data by features (Skil-

licorn and McConnell, 2008; McConnell and Skil-

licorn, 2004), which is appropriate when the number

of features is large. Besides of the type of partition,

another important issue when dealing with distributed

learning is privacy preserving. The goal of privacy

preserving is to learn valuable knowledge from dif-

ferent sources without leakage of any sensitive data,

in other words, share non-sensitive data to infer sen-

sitive data (Wang et al., 2009).

The idea of this research is to deal with distributed

learning problems through distributing vertically the

data and performing a feature selection process which

can be carried out at separate nodes. Since the compu-

tational complexity of most of feature selection meth-

ods is affected by the number of features, the com-

plexity in each node will be reduced with respect to

the centralized approach. Then, the selection pro-

cedure required for the data reduction will be inte-

grated with the classiﬁer learning. For the feature se-

lection step, we choose the χ

metric (Liu and Se-

tiono, 1995), because of its simplicity and effective-

ness. However, this ﬁlter requires data to be discrete,

so a discretization stage has to be added to prepro-

cess the data. Finally, a classiﬁer is necessary, and

the well-known naive Bayes (Rish, 2001) was cho-

sen. This decision has been made because after per-

forming the three stages in each node (discretization,

selection and classiﬁcation), the learned models are

combined in a incremental manner, and naive Bayes

has some characteristics that makes it inherently in-

cremental. With the proposed methodology, it is ex-

pected that the global learning process will be sped up

and so become more computationally efﬁcient.

The rest of the paper is organized as follows: sec-

tion 2 presents the state of the art in the ﬁeld of dis-

tributed learning, section 3 describes the method pro-

posed in this research, section 4 introduces the ex-

perimental settings and results and, ﬁnally, section 5

reveals the conclusions and future lines of research.

2 BACKGROUND AND RELATED

WORK

Most of the research in the literature concerning dis-

tributed learning proposes privacy-preserving meth-

ods for horizontally partitioned data. A meta-learning

approach has been developed that uses classiﬁers

trained at different sites to develop a global classi-

ﬁer (Prodromidis et al., 2000). In (Wolpert, 1992;

Chawla et al., 2002; Tsoumakas and Vlahavas, 2002;

Kantarcioglu and Clifton, 2004; Tsoumakas and Vla-

havas, 2009), the authors proposed several privacy-

preserving classiﬁcation methods for horizontally

partitioned data. However, previous research in dis-

tributed classiﬁcation for vertically partitioned data is

rather sparse. In (Vaidya and Clifton, 2005), it is in-

troduced a generalized privacy preserving variant of

the ID3 algorithm for vertically partitioned data dis-

tributed over two or more parties. In (Vaidya and

Clifton, 2004), the authors addressed classiﬁcation

over vertically partitioned data where either all the

parties hold the class attribute, or only a subset of

the parties have the class attribute. The basic idea

behind their protocol is that each party ends up with

shares of the conditionally independent probabilities

that constitute the parameters of a Na¨ıve Bayes clas-

siﬁer. In (Gangrade and Patel, 2013), all party cal-

culates probabilities (model parameters of a Na¨ıve

Bayes classiﬁer) of all class value for each attribute

value for every attribute individually, causing no pri-

vacy breaches. They use secure multiplication pro-

tocol for multiplying the probabilities of particular

values of all attributes for all class value and com-

pare total probability of all class value and ﬁnd out

the maximum total probability. Other research has

addressed classiﬁcation using Bayesian networks in

vertically partitioned data (Chen et al., 2001), and sit-

uations where the distribution is itself interesting with

respect to what is learned (Wirth et al., 2001).

On the other hand, regarding distributed feature

selection, a distributed privacy-preserving method to

perform feature subset selection that handles both

horizontal as well as vertical data partitioning is pro-

posed in (Banerjee and Chakravarty, 2011). In that

paper a secure distributed protocol was proposed.

It allows feature selection for multiple parties with-

LearningonVerticallyPartitionedDatabasedonChi-squareFeatureSelectionandNaiveBayesClassification

351

out revealing their own data evolving from a method

called virtual dimension reduction. This method is

used in the ﬁeld of hyperspectral image process-

ing for selection of subset of hyperspectral bands

for further analysis. In (Ye et al., 2010), the au-

thors addressed attribute reduction over vertically par-

titioned data, where two parties, each having a pri-

vate dataset want to collaboratively conduct global at-

tribute reduction. By using a semi-trusted third party

and commutative encryption, they presented some se-

cure multi-party computation (SMC) protocols (Yao,

1982) into privacy preserving attribute reduction al-

gorithms. But the method is only proven secure under

the semi-honest model, and security under this ad-

versary model is insufﬁcient. SMC protocols under

the malicious adversary model generally have imprac-

tically high complexities for privacy-preserving data

mining.

As can be seen, distributed classiﬁcation algo-

rithms are becoming more popular in machine learn-

ing. Moreover, some ﬁrst steps are taken towards de-

veloping distributed feature selection methods. How-

ever, to the best knowledge of the authors, none of the

previous research addresses both distributed feature

selection and classiﬁcation simultaneously on verti-

cally partitioned data.

3 THE PROPOSED METHOD

As stated before, distributed feature selection on ver-

tically partitioned data has not been deeply explored

yet. Distributed methods usually consist of three

stages:

1. Partition of the dataset (if the dataset is not dis-

tributed from origin).

2. Application of learning methods in each node. In

the case of the method proposed herein, it consists

of three steps:

(a) Discretization.

(b) Feature selection.

3. Combination of the results.

The interest of this work relies on the independence

of the methodology, that can be performed on all the

nodes at the same time. Besides, the novelty in the

combination stage is that it is done in an incremental

manner. As explained before, the learning methodol-

ogy to be applied to each node consists of three steps:

discretization, feature selection and classiﬁcation (see

Figure 1).

Figure 1: Flow chart of proposed methodology.

3.1 Partition of the Dataset

In some cases, data can be originally distributed by

features. In this manner, different features belonging

to the same sample are recorded in different locations.

Each node gathers the values for one or more features

for a given instance and then, each node has a differ-

ent “view” of the data. For instance, a sensor network

usually records a single feature in each sensor. An-

other example may be a patient that performs several

medical tests in different hospitals. In such these sit-

uations, a distributed learning approach can be much

more efﬁcient computationally than moving all dis-

tributed datasets into a centralized site for learning the

global model. Moreover,even when data are stored in

a single site, distributed learning can be also useful to

speed up the process.

As most of the datasets publicly available are

stored in a centralized manner, the ﬁrst step consists

of partitioning the dataset, i.e. dividing the original

dataset into several disjoint subsets of approximately

the same size that coverthe full dataset. As mentioned

in the introduction, in this research the partition will

be done vertically, as can be seen in Figure 2. Notice

that this step could be eliminated in a real distributed

situation.

3.2 Learning Methods

In this research, the learning stage consists of three

steps: discretizer, ﬁlter and classiﬁer, which will be

following described.

ICAART2014-InternationalConferenceonAgentsandArtificialIntelligence

352

Figure 2: Vertical partition of the data.

3.2.1 Discretizer

Many ﬁlter algorithms are shown to work effectively

on discrete data (Liu and Setiono, 1997), so dis-

cretization is a recommended as a previous step. The

well-known k-means discretization algorithm (Tou

and Gonz´alez, 1977; Ventura and Martinez, 1995)

was chosen because of its simplicity and effective-

ness. K-means moves the representative weights of

each cluster along an unrestrained input space, where

each feature is discretized independently, making it

suitable for our purposes. This clustering algorithm

operates on a set of data points and assumes that

the number of clusters to be determined (k) is given.

The partition is done based on certain objective func-

tion. The most frequently used criterion function in k-

means is minimizing the squared error ε between the

centroids µ

of clusters c

,i = 1,... ,k and the samples

in those clusters

ε =

∑

x∈c

|x− µ

Let C be the set of clusters and |C| its cardinal-

ity. For each new sample x, the discretizer works as

follows,

• If |C| < k and x /∈ C then C = {x} ∪C, i.e. if the

maximum number of cluster was not reached al-

ready and the new sample is not in C, then create

a new cluster with its centroid in x.

• else

1. Find the closest cluster to x.

2. Update its centroid µ as the average of all values

in that cluster.

The method assigns at most k clusters. Notice that

the number of clusters is the minimum between the

parameter k and the number of different values in the

feature.

3.2.2 Filter

The χ

method (Liu and Setiono, 1995) evaluates

features individually by measuring their chi-squared

statistic with respect to the class labels. The χ

value

of an attribute is deﬁned as:

∑

i=1

∑

j=1

− E

)

(1)

where

= R

∗ L

/S (2)

t being the number of intervals (number of different

values in a feature), l the number of class labels, A

the number of samples in the i-th interval, j-th class,

the number of samples in the i-th interval, L

the

number of samples in the j-th class, S the total num-

ber of samples, and E

the expected frequency of A

Note that the size of the matrices is related to the num-

ber of intervals. In this manner, a very large k in the

discretizer will lead to a very large size of the matri-

ces A and E. A very large matrix is computationally

expensive to update and this should be taken into ac-

count for real-time applications.

After calculating the χ

value of all considered

features in each node, these values can be sorted with

the largest one at the ﬁrst position, as the larger the χ

value, the more important the feature is. This will pro-

vide an ordered ranking of features, and a threshold

needs to be established. In this research, the choice is

to estimate the threshold from the effect on the train-

ing set, speciﬁcally using 10% of the training dataset

available at each node so as to speed up the process.

The selection of this threshold must take into account

two different criteria: the training error, e, and the

percentage of features retained, m. Both values must

be minimized to the extent possible. The ﬁtness func-

tion is showed in equation (3), in which the function

f(v) is calculated using those features for which the

value is above v.

f(v) = αe(v) + (1− α)m(v) (3)

α being a value in the interval [0,1] that measures the

relative relevance of both values. Following the rec-

ommendations in (de Haro Garc´ıa, 2011), a value of

α = 0.75 was chosen, since in general the error min-

imization is more important than storage reduction.

For the possible values of the threshold v, three op-

tions were considered for the experimental part:

LearningonVerticallyPartitionedDatabasedonChi-squareFeatureSelectionandNaiveBayesClassification

353

• v = mean(χ

)

• v = mean(χ

) + std(χ

)

• v = mean(χ

) + 2std(χ

)

3.2.3 Classiﬁer

Among the broad range of classiﬁers available in the

literature, the naive Bayes method (Rish, 2001) was

chosen for the classiﬁcation step. This classiﬁer is

simple, efﬁcient and robust to noise and irrelevant at-

tributes. Besides, it requires a small amount of input

data to estimate the necessary parameters for classiﬁ-

cation.

Given a set of l mutually exclusive and exhaustive

classes c

,. .. , c

, which have prior probabilities

P(c

),P(c

),... ,P(c

), respectively, and n attributes

,. .. , a

which for a given instance have values

,. .. , v

respectively, the posterior probability of

class c

occurring for the speciﬁed instance can be

shown to be proportional to

P(c

) × P(a

= v

and a

= v

.. . and a

= v

)

(4)

Making the assumption that the attributes are inde-

pendent, the value of this expression can be calculated

using the product

P(c

)×P(a

= v

)×P(a

= v

)×···×P(a

= v

)

(5)

This product is calculated for each value of i from

1 to l and the class which has the largest value is cho-

sen. Notice that this method is suitable for a dynamic

space of input features.

3.3 Combination of the Results

After performing the previous stages, the methodol-

ogy will return as many trained classiﬁers as nodes we

have. These classiﬁers are trained using only the fea-

tures selected in each node. The ﬁnal step consists of

combining all the trained classiﬁers in an incremental

manner, in order to have a unique classiﬁer trained on

the subset of features resulted of the union of the fea-

tures selected in every node. This combination is pos-

sible because the naive Bayes classiﬁer is inherently

incremental. In this algorithm each feature makes an

independent contribution towards the prediction of a

class as stated in the previous section.

Notice that the main contribution of this paper re-

lies in this merging step. The formulation of the naive

Bayes classiﬁer allows to build an exact solution, i.e.

the same as would be obtained in batch learning. For

this reason, the solution achieved is more reliable

than other schemes, such as voting. Moreover, this

methodology is ﬂexible, since it can work indepen-

dently of the number of nodes, the number of features

selected and so on.

4 EXPERIMENTAL STUDY

4.1 Materials

Table 1 summarizes the number of input features,

samples, and output classes of the data sets. A more

detailed description of the data sets can be found in

(Frank and Asuncion, 2010).

Table 1: Brief description of the binary data sets.

Name Features Training Test

samples samples

Madelon 500 1,600 800

MNIST 717 40,000 20,000

Mushrooms 112 5,416 2,708

Ozone 72 1,691 845

Spambase 52 3,068 1,533

4.2 Performance Metrics

In addition to the traditional approach of evaluating

the performance of an algorithm in terms of test accu-

racy, a distributed algorithm can be also evaluated in

terms of speed-up (Bramer, 2013). Speed-up exper-

iments evaluate the performance of the system with

respect to the number of nodes for a given dataset.

We measure the training time as the number of nodes

is increased. This shows how much a distributed al-

gorithm is faster than the serial (one processor) ver-

sion, as the dataset is distributed to more and more

nodes. We can deﬁne two performance metrics asso-

ciated with speep-up:

• The speedup factor S

is deﬁned by S

where R

and R

are the training times of the

algorithm on a single and n nodes, respectively.

This factor measures how much the training time

is faster using n nodes instead of just one. The

ideal case is that S

= n, but the usual situation

is that S

< n because of communication or other

overheads. Occasionally, it can be a value greater

than n, in the case of what is known as superlinear

speedup.

• The efﬁciency E

of using n nodes instead of one

is deﬁned by E

, i.e. the speedup factor di-

vided by the number of nodes.

ICAART2014-InternationalConferenceonAgentsandArtificialIntelligence

354

4.3 Experimental Procedure

The evaluation of the methods has been done using

holdout validation. Training data have been scattered

across either 2, 4, or 8 nodes; 1 node was also con-

sidered to perform a comparative study with the stan-

dard centralized approach. Experiments were run 10

times with random partitions of the datasets in order

to ensure reliable results. We use the methodology

proposed by Demsar (Demˇsar, 2006) to perform a sta-

tistical comparison of the algorithms over the multi-

ple data sets. First a Friedman test (Friedman, 1940)

is done and then, if signiﬁcant differences are found,

the Bonferroni-Dunn test (Dunn, 1961) is considered.

4.4 Results

Table 2 shows the training time of the algorithm for

the different datasets and number of nodes. As can be

seen, the training time is dramatically reduced as the

number of nodes is increased. Statistical tests demon-

strate that doubling the number of nodes obtains sig-

niﬁcantly better results in terms of time.

Table 2: Training time (s).

Number of nodes

1 2 4 8

Madelon 127.74 63.96 31.89 16.18

MNIST 4643.43 2336.79 1165.28 585.93

Mushrooms 94.28 46.96 23.58 11.82

Ozone 19.95 9.96 4.99 2.50

Spambase 28.20 14.14 7.10 3.57

Table 3 shows the test accuracy on the different

datasets for the different number of nodes. In gen-

eral terms, the accuracy is maintained as the number

of nodes in increased. Statistical tests prove this fact.

However, this is not the case of MNIST dataset. The

accuracy of the algorithm on this dataset when using

8 nodes is signiﬁcantly worse in comparison with its

performance when using 1 (batch), 2, or 4 nodes. This

dataset has a large number of classes, so the less fea-

tures in each node, the more difﬁcult to ﬁnd a cor-

relation between them and the classes. For this kind

of multiclass datasets, it seems that it is necessary a

more exhaustive experimentation to ﬁnd the optimal

number of nodes.

Finally, Figure 3 shows three graphs representing

the different measures related with the time perfor-

mance of the algorithm. Figure 3(a) plots the train-

ing time versus the number of nodes. Figure 3(b)

shows a graph of speedup factor against the number of

nodes. This form of display is often preferred to the

more obvious plot of training time versus the num-

ber of nodes, as it makes straightforward to see the

Table 3: Test accuracy (%).

Number of nodes

1 2 4 8

Madelon 71.38 71.27 71.05 70.82

MNIST 76.26 76.31 76.36 67.06

Mushrooms 93.25 92.04 91.81 91.78

Ozone 86.15 85.93 85.76 85.73

Spambase 88.78 88.72 88.76 88.81

impact on the training time of increasing the num-

ber of nodes. As can be deduced from Figure 3(c),

the efﬁciency of the proposed method is close to 1,

i.e. increasing the number of nodes by n divides the

training time by the same n. Notice the implications

of these results when dealing with high dimensional

datasets. The training time may be notably reduced

without compromising the classiﬁcation accuracy. In

this manner,the proposed methodology allows to deal

with problems which were intractable with classical

approaches.

5 CONCLUSIONS

In this work, a new method for scaling up feature se-

lection was proposed. The idea was to vertically dis-

tribute the data and then performing a feature selec-

tion process which could be carried out at separate

nodes. Thus, the proposed methodology consists of

three different steps: discretization of the data, fea-

ture selection, and classiﬁcation. All the stages are

executed in parallel and ﬁnally, the learned models

obtained from each node are combined in an incre-

mental manner. For this reason, the classical naive

Bayes classiﬁer was modiﬁed so as to be able to make

it work incrementally.

The proposed methodology has been tested on ﬁve

datasets considered representative of problems from

medium to large size, and different numbers of nodes

to distribute the data were considered. As expected,

the larger the number of nodes, the shorter the time

required for the computation. However, in most of

the datasets, increasing the number of the nodes did

not lead to a signiﬁcative degradation in classiﬁcation

accuracy.

As future work, we plan to continue this research

using other feature selection algorithms and classi-

ﬁers, and trying another distributed approach where

all nodes share their results after each step (discretiza-

tion, feature selection and classiﬁcation). In this

sense, the difﬁcult of this future line of research lies

on the fact that for this approach, all the methods have

to be adapted to work in an incremental fashion.

LearningonVerticallyPartitionedDatabasedonChi-squareFeatureSelectionandNaiveBayesClassification

355

1 2 3 4 5 6 7 8

Number of nodes

Time (s)

Madelon MNIST Mushrooms Ozone Spambase

(a) Nodes vs time

1 2 3 4 5 6 7 8

Number of nodes

Speedup factor

Madelon MNIST Mushrooms Ozone Spambase

(b) Speed up

1 2 3 4 5 6 7 8

0.9

0.95

1.05

1.1

Number of nodes

Efficiency

Madelon MNIST Mushrooms Ozone Spambase

Figure 3: Plots regarding time performance of the algorithm.

ACKNOWLEDGEMENTS

This research has been partially funded by the Sec-

retar´ıa de Estado de Investigaci´on of the Spanish

Government and FEDER funds of the European

Union through the research project TIN 2012-37954.

Ver´onica Bol´on-Canedo and Diego Peteiro-Barral ac-

knowledge the support of Xunta de Galicia under

Plan I2C Grant Program.

REFERENCES

Ananthanarayana, V., Subramanian, D., and Murty, M.

(2000). Scalable, distributed and dynamic mining

of association rules. High Performance Computing

HiPC 2000, pages 559–566.

Banerjee, M. and Chakravarty, S. (2011). Privacy preserv-

ing feature selection for distributed data using virtual

dimension. In Proceedings of the 20th ACM interna-

tional conference on Information and knowledge man-

agement, CIKM ’11, pages 2281–2284. ACM.

Bellman, R. (1966). Dynamic programming. Science,

153(3731):34–37.

Bol´on-Canedo, V., S´anchez-Maro˜no, N., and Alonso-

Betanzos, A. (2011). Feature selection and classiﬁ-

cation in multiple class datasets: An application to

kdd cup 99 dataset. Expert Systems with Applications,

38(5):5947–5957.

Bramer, M. (2013). Dealing with large volumes of data. In

Principles of Data Mining, pages 189–208. Springer.

Chan, P., Stolfo, S., et al. (1993). Toward parallel and dis-

tributed learning by meta-learning. In AAAI workshop

in Knowledge Discovery in Databases, pages 227–

240.

Chawla, N. V., Hall, L. O., Bowyer, K. W., Moore Jr, T.,

and Kegelmeyer, W. P. (2002). Distributed pasting of

small votes. In Multiple Classiﬁer Systems, pages 52–

61. Springer.

Chen, R., Sivakumar, K., and Kargupta, H. (2001). Dis-

tributed web mining using bayesian networks from

multiple data streams. In Data Mining, 2001. ICDM

2001, Proceedings IEEE International Conference on,

pages 75–82. IEEE.

Czarnowski, I. (2011). Distributed learning with data re-

duction. In Transactions on computational collective

intelligence IV, pages 3–121. Springer-Verlag.

de Haro Garc´ıa, A. (2011). Scaling data mining algorithms.

Application to instance and feature selection. PhD

thesis, Universidad de Granada.

Demˇsar, J. (2006). Statistical comparisons of classiﬁers

over multiple data sets. The Journal of Machine

Learning Research, 7:1–30.

Dunn, O. J. (1961). Multiple comparisons among

ICAART2014-InternationalConferenceonAgentsandArtificialIntelligence

356

means. Journal of the American Statistical Associa-

tion, 56(293):52–64.

Forman, G. (2003). An extensive empirical study of feature

selection metrics for text classiﬁcation. The Journal

of Machine Learning Research, 3:1289–1305.

Frank, A. and Asuncion, A. (2010). UCI machine learning

repository.

Friedman, M. (1940). A comparison of alternative tests of

signiﬁcance for the problem of m rankings. The An-

nals of Mathematical Statistics, 11(1):86–92.

Gangrade, A. and Patel, R. (2013). Privacy preserving

three-layer na¨ıve bayes classiﬁer for vertically parti-

tioned databases. Journal of Information and Com-

puting Science, 8(2):119–129.

Guyon, I., Gunn, S., Nikravesh, M., and Zadeh, L. (2006).

Feature extraction: foundations and applications, vol-

ume 207. Springer.

Kantarcioglu, M. and Clifton, C. (2004). Privacy-

preserving distributed mining of association rules on

horizontally partitioned data. Knowledge and Data

Engineering, IEEE Transactions on, 16(9):1026–

1037.

Liu, H. and Setiono, R. (1995). Chi2: Feature selection

and discretization of numeric attributes. In Tools with

Artiﬁcial Intelligence, 1995. Proceedings., Seventh In-

ternational Conference on, pages 388–391. IEEE.

Liu, H. and Setiono, R. (1997). Feature selection via dis-

cretization. Knowledge and Data Engineering, IEEE

Transactions on, 9(4):642–645.

McConnell, S. and Skillicorn, D. (2004). Building predic-

tors from vertically distributed data. In Proceedings of

the 2004 conference of the Centre for Advanced Stud-

ies on Collaborative research, pages 150–162. IBM

Press.

Prodromidis, A., Chan, P., and Stolfo, S. (2000). Meta-

learning in distributed data mining systems: Issues

and approaches. Advances in distributed and paral-

lel knowledge discovery, 3.

Rish, I. (2001). An empirical study of the naive bayes clas-

siﬁer. In IJCAI 2001 workshop on empirical methods

in artiﬁcial intelligence, volume 3, pages 41–46.

Saari, P., Eerola, T., and Lartillot, O. (2011). Generaliz-

ability and simplicity as criteria in feature selection:

application to mood classiﬁcation in music. Audio,

Speech, and Language Processing, IEEE Transactions

on, 19(6):1802–1812.

Saeys, Y., Inza, I., and Larra˜naga, P. (2007). A review of

feature selection techniques in bioinformatics. Bioin-

formatics, 23(19):2507–2517.

Skillicorn, D. and McConnell, S. (2008). Distributed pre-

diction from vertically partitioned data. Journal of

Parallel and Distributed computing, 68(1):16–36.

Tou, J. and Gonz´alez, R. (1977). Pattern recognition prin-

ciples. Addison-Wesley.

Tsoumakas, G. and Vlahavas, I. (2002). Distributed data

mining of large classiﬁer ensembles. In Proceedings

Companion Volume of the Second Hellenic Confer-

ence on Artiﬁcial Intelligence, pages 249–256.

Tsoumakas, G. and Vlahavas, I. (2009). Distributed data

mining. Database Technologies: Concepts, Method-

ologies, Tools, and Applications, 1:157.

Vaidya, J. and Clifton, C. (2004). Privacy preserving na¨ıve

bayes classiﬁer for vertically partitioned data. In 2004

SIAM International Conference on Data Mining, Lake

Buena Vista, Florida, pages 522–526.

Vaidya, J. and Clifton, C. (2005). Privacy-preserving deci-

sion trees over vertically partitioned data. In Data and

Applications Security XIX, pages 139–152. Springer.

Ventura, D. and Martinez, T. (1995). An empirical com-

parison of discretization methods. In Proceedings of

the Tenth International Symposium on Computer and

Information Sciences, pages 443–450.

Wang, J., Luo, Y., Zhao, Y., and Le, J. (2009). A survey on

privacy preserving data mining. In Database Technol-

ogy and Applications, 2009 First International Work-

shop on, pages 111–114. IEEE.

Wirth, R., Borth, M., and Hipp, J. (2001). When distribu-

tion is part of the semantics: A new problem class for

distributed knowledge discovery. In Proceedings of

the PKDD 2001 workshop on ubiquitous data mining

for mobile and distributed environments, pages 56–64.

Citeseer.

Wolpert, D. H. (1992). Stacked generalization. Neural net-

works, 5(2):241–259.

Yao, A. C. (1982). Protocols for secure computations. In

Proceedings of the 23rd Annual Symposium on Foun-

dations of Computer Science, pages 160–164.

Ye, M., Hu, X., and Wu, C. (2010). Privacy preserving

attribute reduction for vertically partitioned data. In

Artiﬁcial Intelligence and Computational Intelligence

(AICI), 2010 International Conference on, volume 1,

pages 320–324. IEEE.

Zhao, Z. and Liu, H. (2011). Spectral Feature Selection for

Data Mining. Chapman & Hall/Crc Data Mining and

Knowledge Discovery. Taylor & Francis Group.

LearningonVerticallyPartitionedDatabasedonChi-squareFeatureSelectionandNaiveBayesClassification

357