Speeding up Support Vector Machines

Probabilistic versus Nearest Neighbour Methods for Condensing Training Data

Moïri Gamboni, Abhijai Garg, Oleg Grishin, Seung Man Oh, Francis Sowani,

Anthony Spalvieri-Kruse, Godfried T. Toussaint and Lingliang Zhang

Faculty of Science, New York University Abu Dhabi, P.O. Box 129188, Abu Dhabi, U.A.E.

Keywords: Machine Learning, Data Mining, Support Vector Machines, SMO, Training Data Condensation, k-nearest

Neighbour Methods, Blind Random Sampling, Guided Random Sampling, Wilson Editing, Gaussian

Condensing.

Abstract: Several methods for reducing the running time of support vector machines (SVMs) are compared in terms

of speed-up factor and classification accuracy using seven large real world datasets obtained from the UCI

Machine Learning Repository. All the methods tested are based on reducing the size of the training data that

is then fed to the SVM. Two probabilistic methods are investigated that run in linear time with respect to the

size of the training data: blind random sampling and a new method for guided random sampling (Gaussian

Condensing). These methods are compared with k-Nearest Neighbour methods for reducing the size of the

training set and for smoothing the decision boundary. For all the datasets tested blind random sampling gave

the best results for speeding up SVMs without significantly sacrificing classification accuracy.

1 INTRODUCTION

One of the most attractive learning machine models

for pattern recognition applications, from the point

of view of high classification accuracy, appears to be

the Support Vector Machine (SVM) (Vapnik, 1995).

There exists empirical evidence that SVMs yield

lower rates of misclassification than even the

classical k-Nearest Neighbour rule (Toussaint and

Berzan, 2012), in spite of the fact that (at least in

theory) the latter is asymptotically Bayes optimal for

all underlying probability distributions (Devroye,

1981). The drawback of SVMs is their worst-case

complexity, which is O(N

), where N is the number

of instances in the training set, so that for very large

datasets the training time may become prohibitive

(Bordes, Ertekin, Weston, Bottou, 2005.). Therefore

much effort has been devoted to finding ways to

speed up SVMs (Almeida, Braga, and Braga, 2000;

Chen and Chen, 2002; Panda, Chang, and Wu, 2006;

Wang, Zhou, Huang, Liang, and Yang, 2006; Chen

and Liu, 2011; Li, Cervantes and Yu, 2012; Liu,

Beltran, Mohanchandra and Toussaint, 2013; Chen,

Zhang, Xue, and Liu, 2013). The simplest approach

is to select a small random sample of the data for

training (Lee and Mangasarian, 2001). This

approach may be trivially implemented in O(N)

worst-case time. Here this method is called blind

random sampling because it uses no information

about the underlying structure of the data. Non-blind

random sampling techniques such as Progressive

Sampling (PS) and Guided Progressive Sampling

(GPS) have also been investigated with some

success (Provost, Jensen and Oates, 1999; Ng and

Dash, 2006; Portet, Gao, Hunter and Quiniou, 2007).

Non-random sampling methods attempt to use

intelligent data analysis such as genetic algorithms

(Kawulok and Nalepa, 2012) or proximity graphs

(Toussaint and Berzan, 2012; Liu, Beltran,

Mohanchandra and Toussaint, 2013) to preselect a

supposedly better representative subset of the

training data, which is then fed to the SVM, in lieu

of the large original set of data. However, the use of

guided data condensation methods usually incurs an

additional worst-case cost of O(N log N) to O(N

Since 1968 the literature contains a plethora of such

algorithms and heuristics of varying degrees of

computational complexity, for preselecting small

subsets of the training data that will perform well

under a variety of circumstances (Hart, 1968;

Sriperumbudur & Lanckriet, 2007; Toussaint, 2005).

Although such techniques naturally speed up the

training phase of the SVMs, by virtue of the smaller

364

Gamboni M., Garg A., Grishin O., Man Oh S., Sowani F., Spalvieri-Kruse A., T. Toussaint G. and Zhang L..

Speeding up Support Vector Machines - Probabilistic versus Nearest Neighbour Methods for Condensing Training Data.

DOI: 10.5220/0004927003640371

In Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods (ICPRAM-2014), pages 364-371

ISBN: 978-989-758-018-5

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

size of the training data, many studies primarily

focus on and report only the number of support

vectors retained, ignoring the additional time taken

to perform the pre-selection. Indeed, it has been

shown empirically, for methods that used proximity

graphs for training data condensation, that if the

additional time taken by the pre-selection step is

taken into consideration, the overall training time is

generally much worse than that of simple blind

random sampling (Liu, Beltran, Mohanchandra and

Toussaint, 2013).

Some hybrid methods that combine blind random

sampling with structured search for good

representatives of datasets have also been tried. An

original method combining blind random sampling

with SVM and near neighbour search, has recently

been suggested by Li, Cervantes and Yu (2012).

Their approach first uses blind random sampling to

select a small subset of the data, from which the

support vectors are extracted using a preliminary

SVM. These support vectors are then used to select

points from the original training set (the data

recovery step) that are near the preliminary support

vectors, thus yielding the condensed training set on

which the final SVM is applied.

In this paper several methods for speeding up the

running time of support vector machines (SVMs) are

compared in terms of the speed-up factor and the

classification accuracy, using seven large real world

datasets taken from the University of California at

Irvine Machine Learning Repository (Bache,

Lichman, 2013). All the methods are based on

efficiently reducing the size of the training data that

is subsequently fed to the SVM (SMO). Two

probabilistic methods are investigated that run in

O(N) worst-case time. The first uses blind random

sampling and the second is a new method proposed

here for guided random sampling (Gaussian

Condensing). These methods are also compared with

the standard leading competitor, the k-Nearest

Neighbour rule, as well as nearest neighbour

methods for reducing the size of the training set (k-

NN condensing), and for smoothing the decision

boundary (Wilson editing), both of which run in

O(N

) worst-case time.

2 THE CLASSIFIERS TESTED

2.1 Blind Random Sampling

Blind random sampling is the simplest method for

reducing the size of the training set, both

conceptually and computationally, with a running

time of O(N). Its possible drawback, in theory, is

that it is blind with respect to the quality of the

resulting reduced training set, although this need not

result in poor performance. In the experiments

reported here the percentages of training data

randomly selected for training the SVM (SMO) were

varied from 10% to 90% in increments of 10%.

2.2 Wilson Editing (Smoothing)

Wilson’s editing algorithm was used for smoothing

the decision boundary (Wilson, 1973). Each instance

X in the training set is classified using the 3-Nearest

Neighbour rule (the three nearest neighbours of X,

not including itself). Classification is done by means

of a majority vote. If the instance is misclassified it

is marked. After all instances have been classified,

all the marked points are deleted. This condensed set

is then used to classify the testing set. Wilson editing

was not designed to significantly reduce the size of

the training set; its goal is rather to improve

classification accuracy, and is used here as a pre-

processing step before reducing the training set

further with methods tailored for that purpose. Since

k = 3, Wilson smoothing runs in O(N

) worst-case

time using software packages available in the Weka

Machine Learning Software (Witten and Frank,

2000), and a straightforward naïve implementation.

2.3 k-nearest-Neighbour Condensation

When all k nearest neighbours of a point X belong to

the class of X, the k-NN rule makes a decision with

very high confidence. In other words the point X is

surrounded by close data points from its own class,

and is therefore located relatively far from the

decision boundary. This suggests that many points

with this property could be safely deleted. Before

classifying each testing set, the corresponding

training set is condensed as follows. Each instance in

the training set is classified using the k-NN rule (not

including itself). If the instance is correctly

classified with very high confidence it is marked.

After all instances are classified, all marked points

are deleted. This condensed set is then used to

classify the testing set. High confidence in the

classification of X is measured by the proportion of

the k nearest neighbours of X that belong to the class

of X. The standard k-NN rule uses a majority vote as

its measure of confidence. In our approach we use

the unanimity vote (all the k nearest neighbours

belong to the same class), and select a good value of

k. This algorithm runs in O(kN

) worst-case time

using the naïve straightforward implementation and

SpeedingupSupportVectorMachines-ProbabilisticversusNearestNeighbourMethodsforCondensingTrainingData

365

packages available in Weka. Note that when data of

different classes are widely separated it may happen

(at least in theory) that for every point X its k nearest

neighbours all belong to the class of X. In such a

situation the unbridled k-NN condensation might

discard the entire training set. For such an

eventuality, if for some pattern class all training

instances are marked for deletion, the mean of those

instances is retained as the representative of that

class. Experiments were also performed with k-NN

condensation preceded by Wilson editing.

2.4 Gaussian Condensing

Gaussian Condensing is a novel heuristically guided

random sampling algorithm introduced here. The

heuristic implemented assumes that instances with

feature values relatively close to the mean of their

own class are likely to be furthest from the decision

boundary, and therefore not expected to contain

much discrimination information. Conversely, points

relatively far from the mean are likely to be closer to

the decision boundary, and expected to contain the

most useful information. First, for each class, the

mean value of each feature is calculated. Then, for

each feature in each instance, the ratio between the

Gaussian function of the mean, and the Gaussian

function of the feature value of that instance is

computed. This determines a parameter termed the

partial discarding probability. Finally, all instances

are discarded probabilistically in parallel with a

probability equal to the mean of the partial

discarding probabilities of all their features. The

main attractive attribute of this algorithm is that it

runs in O(N) worst-case time, where N is the number

of training instances. It is therefore linear with

respect to the size of the training data, and thus

much faster than previous discarding methods that

use proximity graphs, which are either quadratic or

cubic in N. Indeed, the complexity of Gaussian

Condensing is as low as that of blind random

sampling.

The goal of Gaussian Condensing is to invert the

probability distribution function of instances for all

features of each class. Hence, points near the mean

are certain to be thrown away, and points near the

boundaries are almost never thrown away. If applied

to data with a Gaussian distribution, the probability

distribution function would result in an inverted bell

curve, with the minimum point occurring at the

center, and increasing towards the boundaries before

decreasing again. A similar idea was introduced by

Chen, Zhang, Xue, and Liu, (2013), with strong

results. However, their algorithm deletes a ratio of

the total data closest to the mean. The approach

proposed here is superior in two ways: (1) it does

not require a method to decide the ratio of data that

should be optimally kept, and (2) it does not create a

“hole” in the data, but rather preserves the entire

distribution of points, by simply altering the density.

Experiments were also performed with Gaussian

Condensing preceded by Wilson editing.

3 THE DATASETS TESTED

Wine Quality Data: The white wine quality dataset

includes over 2000 different vinho verde wines

(instances). The dataset comprises twelve features

that include acidity and sulphate content. There are

ten classes defined in terms of quality ratings that

vary between 1 and 10.

Year Prediction Million Song Data: The original

dataset is extremely large, (515,345 instances) and

therefore some of the data were randomly discarded.

The pattern classes were converted from years to

decades (1950s through 2000s) and then 3,000

instances of each class were chosen, comprising six

classes with a total of 18,000 instances.

Handwritten Digits Data: This dataset contains 32

by 32 bitmaps that have been obtained by centering

and normalising the input images from 43 different

people. The training set consisting of 5,620 instances

and has data from 30 people, while the test set

comes from the 13 others, so as to prevent learning

algorithms from classifying digits based on the

writing style rather than features of the shape of the

digits themselves. To decrease the dimensionality of

the data, the bitmaps are divided into 4 by 4 blocks

and the number of pixels in each block is counted.

The total number of features is thus 63 and the

number of classes is 10, the digits 0 through 9.

Letter Image Data: This dataset contains black-

and-white rectangular pixel displays of the 26 upper-

case letters in the English alphabet. The letter

images were constructed from twenty different fonts.

Each letter from the twenty fonts was randomly

distorted to produce 20,000 unique instances. Each

instance is described using 17 attributes: a letter

category (A, B, C, …, Z) and 16 numeric features.

Wearable Computing Data: This dataset (PUC-

Rio) contains information matching accelerometer

readings from various parts of the human body, with

the readings taken while the actions were performed.

Accelerometers collected x, y, and z axes data from

the waist, left-thigh, right ankle, and right upper-arm

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

366

of the subjects. In each instance, the subjects were

either sitting-down, standing-up, walking, in the

process of standing, or in the process of sitting.

Metadata about the gender, age, height, weight and

BMI of each subject are also provided. In total

165,632 instances of such data are included in this

dataset.

Spambase data: The Spambase dataset provides

information about email spam. The emails are

classified into two categories: spam and non-spam.

The data labelled spam were collected from

postmasters and individuals who had reported spam,

and the non-spam data were collected from filed

work and personal emails. The dataset was created

with the goal of designing a personal spam filter. It

contains 4601 instances, of which 1813 (39.4%) are

spam. These instances are characterized by 57

attributes (57 continuous features and one nominal

class label). The class label is either 1 or 0,

indicating that the email is either spam or non-spam,

respectively.

The MAGIC Gamma Telescope data: This data

consist of Monte-Carlo generated simulations of

high-energy gamma particles. There are ten

attributes, each continuous, and two classes (‘g’ and

‘h’). The number of instances was 12,332 for ‘g’ and

6,688 for ‘h’. For the purpose of this study

approximately half of class ‘g’ was removed at

random, since the goal of the present research is the

improvement of the running time of SVMs, rather

than the minimization of the probability of

misclassification for this particular application.

4 RESULTS AND DISCUSSION

4.1 The Computation Platform

The timing experiments were performed on the

fastest high-performance computer available in the

United Arab Emirates (second fastest in the Gulf

region): BuTinah, operated by New York University

Abu Dhabi. The computer consists of 512 nodes,

each one equipped with 12 Intel XeonX5675 CPU’s

clocked at 3.07GHZ and 48GB of RAM with 10GB

of swap memory. BuTinah operates at approximately

70 trillion floating-point operations per second (70

teraflops). The experiments utilized seven nodes,

in total, consuming 9 hours of computation time and

12 GB of memory. The testing environment was

programmed in Java, using the Weka Data Mining

Package, produced by the University of Waikato.

4.2 Blind Random Sampling

The SMO (Sequential Minimization Optimization)

version of SVM, invented by John Platt (1998) and

improved by Keerthi, Shevade, Bhattacharyya, and

Murthy, (2001) that is installed in the Weka machine

learning package was compared to the classical k-

NN decision rule when both are preceded by blind

random removal of data before feeding the

remaining data to each classifier. A typical result

obtained with the Wearable Computing dataset is

shown in Figure 1, for the classification accuracy

(left vertical axis) and the total running time (right

vertical axis). Total time refers to the sum of the

times taken for training data condensation, training

time, and testing time (results for the three

individual timings will be presented in a following

section). In this and all other experiments the

classification accuracies and timings were obtained

by the method of K-fold cross-validation (or 

method) with a value of K = 10 (Toussaint, 1974).

This means that for each of the classifiers and

condensing methods tested the procedure for

estimating the classification accuracy for each fold

was the following. Let {X} denote the entire dataset.

The ith fold is obtained by taking the ith 10% of {X}

as the testing set, denoted by {X

TS-i

}, and the

remaining 90% of the data as the training set,

denoted by {X

TR-i

}. Estimates of the

misclassification accuracy of any classifier are then

obtained by training the classifier on {X

TR-i

}, and

testing it on {X

TS-i

}, for i = 1, 2, …, 10, yielding a

total of ten estimates. Similarly, when estimating the

classification accuracy of an editing (or condensing)

method, the editing (or condensing) is first applied

to {X

TR-i

}, and the resulting edited (condensed) set is

used to classify {X

TS-i

}. Finally, in all cases the

average of the ten estimates obtained in this way is

calculated. Thus the results shown in the figures are

the mean values over the ten folds. This method also

permits the computation of standard deviations (over

the ten folds) to serve as indicators of statistically

significant differences between the means. The error

bars in the figures indicate ± one standard deviation.

All seven datasets exhibit similar behaviour to that

depicted in Figure 1, with respect to how the

classification accuracy varies as a function on the %

of training data removed. The classification accuracy

results are not unanimous, but favour k-NN over

SMO, the latter having significantly better accuracy

than k-NN only for the Song data (Figure 2). For the

Letter Image, Wearable Computing, and MAGIC

Gamma datasets k-NN did significantly better

(example: Figure 1). Furthermore, for some of the

SpeedingupSupportVectorMachines-ProbabilisticversusNearestNeighbourMethodsforCondensingTrainingData

367

datasets such as the Spam, Wine, and Handwritten

Digits data there are no significant differences

between SMO and k-NN. For the Spambase data

SMO is significantly better only when more than

60% of the data are discarded (see Figure 3).

Figure 1: Accuracy and time vs % training data removed

by random sampling for Wearable Computing data.

Figure 2: Accuracy and time versus % of training data

removed by blind random sampling for Song data.

Figure 3: Accuracy and time versus % of training data

removed by blind random sampling for Spambase data.

With respect to the total time taken by SMO and k-

NN, in all the datasets, SMO takes considerably less

running time than k-NN, and all show behaviour

similar to the curves in Figures 1-3. For example, if

70% of the data are discarded then k-NN runs about

five times faster (and SMO about ten times faster)

than when all the data is used for training. This is

not too surprising since k-NN runs in O(N

)

expected time and SMO is able to run faster in

practice depending on the structure of the data.

4.3 The Condensing Classifiers

Experiments were done applying various training

data condensation classifiers to reduce the size of

the training data that was fed to both the SMO and k-

NN classifiers. The condensing classifiers tried

were: (1) Gaussian condensation, (2) Wilson editing,

(3) Wilson editing+Gaussian condensation, (4)

Wilson editing+k-NN condensation, and (5) k-NN

condensation. Figures 4-10 show the per-cent mean

accuracy and mean total running times (in seconds)

for all the five condensation classifiers, plus the

results for blind random sampling obtained by

discarding 40% (rem40) and 70% (rem70) of the

training data. In all the figures ‘Con’ indicates

condensation, ‘Wilson’ denotes Wilson smoothing,

and ‘Gauss’ stands for Gaussian condensation.

Figure 4: Accuracy and total time of condensing algorithm

methods for the Spambase data.

Perusal of the figures reveals that none of the

condensing methods improves the accuracy of the

classifiers that do not use condensing. With some of

the datasets the accuracy remains unchanged, such

as for the Wearable Computing data in Figure 7, and

the Handwritten Digits in Figure 8. For other

datasets accuracy suffers considerably, as with the

Wine data in Figure 5, and the Song data in Figure 9.

Figure 5: Accuracy and total time of condensing algorithm

methods for the Wine data.

With k-NN classification, random removal of 70%

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

368

of the training set, and Gaussian condensation were

the fastest methods for all the datasets other than the

Handwritten Digits. Similar relative behavior was

observed with SMO. However, in all cases the

running times for condensing with SMO were much

smaller than those for condensing with k-NN.

Figure 6: Accuracy and total time of condensing algorithm

methods for the Letter Image data.

Figure 7: Accuracy and total time of condensing algorithm

methods for the Wearable Computing data.

Figure 8: Accuracy and total time of condensing algorithm

methods for the Handwritten Digits data.

In Figures 4-10 the times plotted are total times:

condensing time + training time + testing time. One

of the main goals of this research is to compare the

testing times of the various classifiers, since these

Figure 9: Accuracy and total time of condensing algorithm

methods for the Song data.

Figure 10: Accuracy and total time of condensing

algorithm methods for the MAGIC Gamma data.

reflect the speed of the classifier on all future data.

However, if the condensing times and training times

dominate the testing time, then the total times listed

in the figures may hide the testing times. Therefore a

breakdown of the individual times was also plotted

for all the experiments. Space does not permit

including all the figures, and therefore an example is

offered for the Magic Gamma data in Figure 11.

Note from Figure 10 that the classifiers with the

smallest total times are: SMO, SMO+rem40,

SMO+rem70, and SMO+Gauss. Furthermore their

accuracies are not significantly different. Therefore

the classifiers with the fastest testing times would be

preferred in this case.

Figure 11 shows the breakdown of condensing,

training, testing, and total time for the MAGIC

Gamma data on a linear scale in seconds. This figure

clearly shows how large the testing time for k-NN is

compared to all other classifiers, thus making it

difficult to compare the four classifiers of main

interest. To zoom in on their performance the data

from Figure 11 are shown on a logarithmic scale in

Figure 12, where it can be clearly seen that SMO

runs faster when pre-processed by rem40 or rem60,

but not when pre-processed by Gaussian

SpeedingupSupportVectorMachines-ProbabilisticversusNearestNeighbourMethodsforCondensingTrainingData

369

Figure 11: Breakdown of condensing, training, testing, and

total time for the MAGIC Gamma data.

condensation. Similar behaviour is observed with the

other six datasets.

Figure 12: The data of Figure 11 on a logarithmic scale.

5 CONCLUSIONS

One of the main conclusions that can be made from

the experiments reported here is that blind random

sampling is surprisingly good and robust. For all the

datasets, as much as 70% to 80% of the data may be

discarded, without incurring any significant decrease

in the classification accuracy. Furthermore, for six of

the seven datasets, discarding 70% of the data at

random in this way made k-NN run about five times

faster, and SMO about ten times faster. Since this

method is so simple and requires so little

computation time we believe that it should play a

role as a pre-processing step for speeding up SVMs.

Previous research has shown that SVMs perform

better than k-NN. However, some of the

comparisons have used synthetically generated data

that does not resemble real world data. On the other

hand, the results of the present study with seven

real-world datasets tell a different story. SMO is

significantly better only for the Song data, whereas

k-NN does better for the Letter Images, Wearable

Computing, and Magic Gamma datasets. For the

other three datasets (Spam, Wine, and Handwritten

Digits) there are no significant differences between

SMO and k-NN. In future research we hope to

discover structural features of the data that predict

when SMO is expected to outperform k-NN.

One of the goals of this research was too test

how much Wilson editing improves the accuracy of

classifiers in practice. It was found that for all seven

datasets that using Wilson editing as a pre-

processing step to either SVM or k-NN, yielded no

statistically significant improvement in accuracy.

Furthermore, except for the Wine and Song datasets

Wilson editing incurs a considerable additional cost

in the editing (condensing) time, although it can

speed up the training and testing times.

Another main goal of this research project was to

compare the new proposed method for condensing

training data in O(N) worst-case time: Gaussian

condensation. This probabilistic method falls in the

category of guided (or intelligent) random sampling

and is almost as fast as blind random sampling. The

results of this study show that Gaussian

condensation is competitive with 70% blind random

sampling, with respect to both accuracy and running

time, relative to the other methods tested. However,

the main overall conclusion of this study is that blind

random sampling is the best overall method for

speeding up support vector machines.

ACKNOWLEDGEMENTS

This research was supported by a grant from the

Provost's Office of New York University Abu Dhabi

in the United Arab Emirates. The authors are

grateful to the University of California at Irvine for

making available their large collection of data at the

Machine Learning Repository.

REFERENCES

Bache, K., Lichman, M., 2013. UCI Machine Learning

Repository [http://archive.ics.uci.edu/ml]

Bakir, G. H, Bottou, L., Weston, J., 2004. Breaking SVM

complexity with cross-training. In Advances in Neural

Information Processing Systems 17 (NIPS-2004), Dec.

13-18, 2004, Vancouver, Canada], pp. 81-88.

Bordes, A., Ertekin, S., Weston, J., Bottou, L., 2005. Fast

kernel classifiers with online and active learning. J. of

Machine Learning Research, vol. 6, pp. 1579-1619.

Almeida, M. B., Braga, A. P., Braga, J. P., 2000. SVM-

KM: speeding SVMs learning with a priori cluster

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

370

selection and k-means. In: Proc. of the 6th Brazilian

Symposium on Neural Networks, pp. 162–167.

Chen, J., Zhang, C., Xue, X., Liu, C.-H., 2013. Fast

instance selection for speeding up support vector

machines. Knowledge-Based Systems, vol. 45, pp. 1-7.

Chen, J., Liu, C.-L., 2011. Fast multi-class sample

reduction for speeding up support vector machines.

Proceedings of the IEEE International Workshop on

Machine Learning for Signal Processing, Beijing,

China, September 18-21.

Chen, J., Chen, C., 2002. Speeding up SVM decisions

based on mirror points. Proc. 6

International Conf.

Pattern Recognition, vol. 2, pp. 869-872.

Devroye, L., 1981. On the inequality of Cover and Hart in

nearest neighbour discrimination. IEEE Trans. Pattern

Analysis and Machine Intelligence, vol. 3, pp. 75–78.

Hart, P. E., 1968. The condensed nearest neighbour rule.

IEEE Trans. Infor. Theory, vol. 14, pp. 515–516.

Kawulok, M., Nalepa, J., 2012. Support vector machines

training data selection using a genetic algorithm. In

G.L. Gimel’farb et al. (Eds.): Structural, Syntactic,

and Statistical Pattern Recognition, LNCS 7626, pp.

557–565.

Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., Murthy,

K. R. K., 2001. Improvements to Platt’s SMO

algorithm for SVM classifier design. Neural

Computation, vol. 13, pp. 637-649.

Lee, Y. L., Mangasarian, O. L., 2001. RSVM: Reduced

support vector machines. In Proceedings of the First

SIAM International Conference on Data Mining,

SIAM, Chicago, April 5-7, (CD-ROM).

Li, X., Cervantes, J., Yu, W., 2012. Fast classification for

large datasets via random selection clustering and

Support Vector Machines. Intelligent Data Analysis,

vol. 16, pp. 897-914.

Liu, X., Beltran, J. F., Mohanchandra, N., Toussaint, G.

T., 2013. On speeding up support vector machines:

Proximity graphs versus random sampling for pre-

selection condensation. Proc. International Conf.

Computer Science and Mathematics, Dubai, United

Arab Emirates, Jan. 30-31, Vol. 73, pp. 1037-1044.

Ng W. Q., Dash, M., 2006. An evaluation of progressive

sampling for imbalanced datasets. In Sixth IEEE

International Conference on Data Mining Workshops,

Hong Kong, China. 2006.

Panda, N., Chang, E. Y., Wu, G., 2006. Concept boundary

detection for speeding up SVMs. Proc. 23

International Conf. on Machine Learning, Pittsburgh.

Platt, J. C., 1998. Fast training of support vector machines

using sequential minimial optimization. In Advances

in Kernel Methods: Support Vector Machines, B.

Scholkopf, C. Burges, and A. Smola, Eds., MIT Press.

Portet, F., Gao, F., Hunter, J., Quiniou, R., 2007.

Reduction of large training set by guided progressive

sampling: Application to neonatal intensive care data.

Proc. of Intelligent Data Analysis in Biomedicine and

Pharmacology, Amsterdam, pp. 43-44 .

Provost, F., Jensen, D., Oates, T., 1999. Efficient

progressive sampling. In Fifth ACM SIGKDD

International Conference on Knowledge Discovery

and Data Mining, San Diego, USA, 1999.

Sriperumbudur, B. K., Lanckriet, G., 2007. Nearest

neighbour prototyping for sparse and scalable support

vector machines. Technical Report No. CAL-2007-02,

University of California San Diego.

Toussaint, G. T., Berzan, C., 2012. Proximity-graph

instance-based learning, support vector machines, and

high dimensionality: An empirical comparison.

Proceedings of the Eighth International Conference

on Machine Learning and Data Mining, July 16-19,

2012, Berlin, Germany. P. Perner (Ed.): LNAI 7376,

pp. 222–236, Springer-Verlag Berlin Heidelberg.

Toussaint, G. T., 2005. Geometric proximity graphs for

improving nearest neighbour methods in instance-

based learning and data mining. International J.

Computational Geometry and Applications, vol. 15,

April, pp. 101-150.

Toussaint, G. T., 1974. Bibliography on estimation of

misclassification. IEEE Transactions on Information

Theory, vol. 20, pp. 472-479.

Vapnik, V., 1995. The Nature of Statistical Learning

Theory, Springer-Verlag, New York, NY.

Wang, Y., Zhou, C. G., Huang, Y. X., Liang, Y. C., Yang,

X. W., 2006. A boundary method to speed up training

support vector machines. In: G. R. Liu et al. (eds),

Computational Methods, Springer, Printed in the

Netherlands, pp. 1209–1213.

Wilson, D. L., 1973. Asymptotic properties of nearest

neighbour rules using edited-data. IEEE Trans.

Systems, Man, and Cybernetics, vol. 2, pp. 408–421.

Witten, I., Frank, E., 2000. WEKA: Machine Learning

Algorithms in Java. In Data Mining: Practical

Machine Learning Tools and Techniques with Java

Implementations, Morgan Kaufmann, pp. 265-320.

SpeedingupSupportVectorMachines-ProbabilisticversusNearestNeighbourMethodsforCondensingTrainingData

371