SurfOpt: A New Surface Method for Optimizing the Classiﬁcation of

Imbalanced Dataset

Andr

e Rodrigo da Silva, Leonardo M. Rodrigues and Luciana de Oliveira Rech

Department of Informatics and Statistics (INE), Federal University of Santa Catarina (UFSC)

Rua Delﬁno Conti, s/n, Trindade, Cx.P. 476, CEP: 88040-900, Florian

opolis, Brazil

Keywords:

Machine Learning, Skewed Classes, Imbalanced Datasets, Binary Classiﬁcation.

Abstract:

Imbalanced classes constitute very complex machine learning classiﬁcation problems, particularly if there are

not many examples for training, in which case most algorithms fail to learn discriminant characteristics, and

tend to completely ignore the minority class in favour of the model overall accuracy. Datasets with imbalanced

classes are common in several machine learning applications, such as sales forecasting and fraud detection.

Current strategies for dealing with imbalanced classes rely on manipulation of the datasets as a means of

improving classiﬁcation performance. Instead of optimizing classiﬁcation boundaries based on some measure

of distance to points, this work directly optimizes the decision surface, essentially turning a classiﬁcation

problem into a regression problem. We demonstrate that our approach is competitive in comparison to other

classiﬁcation algorithms for imbalanced classes, in addition to achieving different properties.

1 INTRODUCTION

In many scenarios of practical application of machine

learning, such as sales forecasting (Syam and Sharma,

2018), epidemic prevention (Guo et al., 2017), fraud

detection (Carneiro et al., 2017), and disease evolu-

tion (Zhao et al., 2017), the subpopulation (or class

of interest) may consist of a derisory portion of

the observed events, as in the analogy “needle in a

haystack”, but even harder than that. The events ob-

served as data points may be indistinguishable from

one another for prediction purposes. For example, the

next customer to close a deal with an online business

may have many, if not all, characteristics of a cus-

tomer who has not closed a deal in the past.

As it is not always feasible to sample all the rel-

evant characteristics of a population, the discriminat-

ing characteristics of a population are commonly ne-

glected. This causes the classiﬁer’s performance to

fall due to the fact that the data points began to ap-

pear closer and show more similarity, with overlap-

ping class distributions, which makes decision bound-

aries subject to uncertainty. This work addresses typ-

ical problems in this type of scenario.

Although binary classiﬁcation problems are not

hard classiﬁcation problems alone, characteristics on

the data points and intricacies of the problem can add

further complexity. Typical examples of such char-

acteristics are classes being severely skewed, not lin-

early separable, the set of features being of both cate-

gorical and continuous variables, and the observations

being of sparse nature. Even algorithms that result

in complex non-linear classiﬁcation models will tend

to stop classifying examples as being of the minority

class altogether under skewness to minimize classiﬁ-

cation error.

The accuracy metric is the most popular measure

of a classiﬁer’s performance since accuracy sufﬁces

as a metric for classiﬁcation with balanced classes.

For datasets with imbalanced classes, accuracy is mis-

leading – as it is biased to the majority class – and

insensitive to changes in the quality of classiﬁcation

of the minority class. For example, sales prediction is

a scenario where accuracy limitations come into play.

In this case, it is essential to know how many times

buying customers (i.e. the minority class) were cor-

rectly predicted and to quantify to which fraction of

the total buying customers events it corresponds. A

model can be right every time it predicts a buying cus-

tomer, but if its predictions of non-buying customers

are not also perfect, it can still be misclassifying most

of the buying customers.

Cross-validation of binary classiﬁers results in

four primary metrics, true positives (TP), true nega-

tives (TN), false positives (FP), and false negatives

(FN). From these metrics, it becomes possible to com-

724

Rodrigo da Silva, A., Rodrigues, L. and Rech, L.

SurfOpt: A New Surface Method for Optimizing the Classiﬁcation of Imbalanced Dataset.

DOI: 10.5220/0007407107240730

In Proceedings of the 11th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2019), pages 724-730

ISBN: 978-989-758-350-6

pose more advanced metrics, such as speciﬁcity or

also called True Negative Rate (TNR), which mea-

sures the rate of negative examples being predicted

correctly versus the total negative cases. Another use-

ful metric is the Negative Predictive Value (NPV),

which is the ratio between the number of correctly

predicted negative cases and the total number of pre-

dicted negative cases. When dealing with skewed

data, the accuracy metric of a classiﬁer tends to be

very stable and unable to represent changes of speci-

ﬁcity or negative predictive value (assuming minority

class as negative). Recent years have been of growing

interest on better overall metrics to evaluate a clas-

siﬁer that can represent changes in the prediction of

the minority class, such as the F-measure (F1) (Lipton

et al., 2014) and the Matthews Correlation Coefﬁcient

(MCC) (Boughorbel et al., 2017).

This paper presents an algorithm that handles clas-

siﬁcations tasks with data imbalance and evaluates its

classiﬁcation behaviour. The main contributions of

this work are the following:

• The novel SurfOpt algorithm, which is able to

handle properly data imbalance;

• A parameter-based strategy that allows to directly

optimize the classiﬁcation surface;

• A three-metric diagnostic approach for perfor-

mance issues on classiﬁers;

The next sections of this paper are organised as

follows. Section 2 presents the related work. Sec-

tion 3 includes essential concepts for the development

of this work. Section 4 introduces the SurfOpt algo-

rithm, which is the main contribution of this work.

Section 5 presents an evaluation of the proposed al-

gorithm. Section 7 concludes the paper and presents

suggestions for future work.

2 RELATED WORK

This section presents the work related to this research,

including those considered essential in the areas of

machine learning classiﬁcation, support vector ma-

chines, and boosting-and-bagging algorithms.

2.1 Machine Learning Classiﬁcation

A classiﬁcation task is a problem where a population

(e.g. ﬂowers) needs to be discriminated in its dif-

ferent classes (e.g. iris setosa, versicolor, virginica),

whereas automated classiﬁcation consists of learning

a function that maps input features (i.e. individual

characteristics) to outputs (i.e. class labels).

2.2 Support Vector Machines (SVM)

Linear separable classes can be easily classiﬁed by a

model generated by the SVM algorithm, which works

by ﬁnding an n −1 dimensional hyperplane that sepa-

rates classes in a n dimensional space, maximising the

margin between the hyperplane and each set of points.

This algorithm is convex, meaning that it always ﬁnd

the optimal solution. Not linearly separable classes

can also be efﬁciently classiﬁed with SVM by using

a kernel. The kernel-trick projects points in a higher

dimensional space where they are linearly separable

(Boser et al., 1992; Hofmann, 2006). When used in

a two-dimensional space, SVM is equal to dividing

points with a straight line.

2.3 Boosting and Bagging

Adaboost and Adabag algorithms are ensemble meth-

ods that work by iteratively improving classiﬁcation

results with weak classiﬁers. The ensemble model

of an Adaboost algorithm elects to which class every

observation belongs by using a weighted vote strat-

egy, where the weights of each classiﬁer votes are

learned during training. Adabag uses a simple major-

ity vote strategy, and classiﬁers added to the ensem-

ble in every iteration are independent of the previous

classiﬁers, opposing Adaboost, which derives clas-

siﬁers from previous iterations (Alfaro et al., 2013;

Schapire, 2013).

2.4 Oversampling and Undersampling

Recent work for imbalanced dataset classiﬁcation

use non-algorithmic solutions, such as oversampling,

undersampling or a combination of both strategies

(Haixiang et al., 2017). Oversampling approaches

artiﬁcially increase the number of data-points of the

minority class. Undersampling techniques consist of

removing data-points of the majority class to balance

the dataset. These techniques manipulate the dataset

as a mean to improve the classiﬁer performance.

2.5 Surface Optimization

Algorithms for classiﬁcation generally optimize a

classiﬁcation boundary based on some distance-to-

point metric (i.e. large margin classiﬁers). These

algorithms showed satisfactory convergence proper-

ties and great accuracy in many practical applications.

The weakness of this kind of algorithm lies in its sen-

sibility to outliers, and the fact they optimize for indi-

vidual points and not for the entire set, making classi-

ﬁers subject to bias found in data.

SurfOpt: A New Surface Method for Optimizing the Classiﬁcation of Imbalanced Dataset

725

SurfOpt optimizes decision surface directly, being

able to learn trade-off regions for classiﬁcation, and

being resilient to outliers as the inﬂuence of individual

data points diminishes as the number of data points

grow. We show that even maximum margin classiﬁers

with polynomial kernels such as SVM are unable to

capture trade-off regions on imbalanced data.

3 BACKGROUND

In this section, we will shortly review two concepts

that are important to the SurfOpt algorithm formula-

tion and evaluation.

3.1 Matthews Correlation Coefﬁcient

Equation (1) presents the Matthews Correlation Coef-

ﬁcient (MCC), which can be undeﬁned if any of the

measures TP, TN, FP, TN are equal to zero. The ad-

dition of an inﬁnitesimal positive constant to the de-

nominator of the formula solves that problem. The

MCC equation outputs in the interval [1, −1] if the

MCC of a classiﬁer is 1, which is perfectly correct.

On the other hand, an MCC of −1 means a perfectly

incorrect classiﬁer.

MCC =

T P× T N − F P × FN

(T P + F P) × (T P + FN) × (T N + FP) × (T N + FN)

(1)

3.2 Convex hull

The convex hull of a set of points is the smallest con-

vex set that contains the points (Barber et al., 1996).

The points on a convex hull deﬁne the vertices of a

polygon that contains the entire set. Some implemen-

tations of the convex hull run on linear expected time

(Devroye and Toussaint, 1981). In this work, we show

how to avoid ﬁnding convex hull of sets and distance-

to-point calculations by optimising a curve equation,

and a transformation of the data-points.

4 SurfOpt: OPTIMISING

CLASSIFICATION SURFACE

When classifying data-points of populations that

overlap one another, a non-smooth threshold for the

decision boundary such as a single straight line will

fail to capture the region of best classiﬁcation trade-

off. To best capture the trade-off region, we envi-

sioned a weak classiﬁer with a non-linear decision

boundary. Our classiﬁer boundary is a curve, deﬁned

by the function below:

f (a) = a

∗ exp(ω) (2)

The curve deﬁned by Equation (2) is convenient

as it never assumes negative values. For large positive

values of ω, it approximates a vertical line and, for

small negative values of ω, it approximates a horizon-

tal line. From Equation (2), we learn the parameter

ω that deﬁne the width of the arc. To classify data-

points outside of the curve, it is necessary to do a ker-

nel transformation. Thus, concomitantly, we learn a

transformation to the points. The parameter θ deﬁnes

the angle of rotation to apply to every point P(a, b) in

relation to the origin point P(0, 0).

b = f (a) (3)

= a cos(θ) − b sin(θ) (4)

= a sin(θ) + b cos(θ) (5)

In addition to the rotation transformation, we add

the parameters c and d to shift every point a and b

coordinates in relation to the origin point P(0, 0), as

depicted in Figure 1.

(0,0)

Figure 1: Curve boundary and transformations to the points.

(a) = a cos(θ) − b sin(θ) + c (6)

(b) = a sin(θ) + b cos(θ) + d (7)

Equations (6) and (7) transform the x-axis and the

y-axis of the points on the plane, respectively. Once

a curve decision boundary is set and the points trans-

formed, we then use the Heaviside step function as

the decision function of our algorithm.

g(a) = k

(a) − f (a) (8)

H(a) =

(

0, if g(a) < 0

1, if g(a) ≥ 0

(9)

First, Equation 8 determines if a point is below or

above (i.e. inside) the curve after the transformation.

Then, if it is inside, Equation 9 labels the point as be-

ing of the minority class – denoted by 1 (one) – or the

majority class – denoted by 0 (zero).

ICAART 2019 - 11th International Conference on Agents and Artiﬁcial Intelligence

726

The area deﬁned by the curve within the range

of the data-points is an ellipse-segment, every point

inside of it is predicted to belong to the minority

class. The algorithm evaluates the classiﬁer perfor-

mance and yields a value x ∈ [0, 1] for classiﬁer accu-

racy. In order to maximise x, we use the logistic loss,

which can be deﬁned as follows:

Cost(x, y) =

(

−log(h

(x)), if y=1

−log(1 − h

(x)), if y=0

(10)

In Equation (10) h

(x) denotes the hypothesis func-

tion with parameters θ with input x, our hypothesis

function are (6,7), we assume y to be always equal to

1 since it is the maximum value for the accuracy met-

ric (perfect classiﬁer) that we want to achieve. With

Equation 10, we are able to calculate the gradient de-

scent and penalise the parameters of our hypothesis

function every iteration.

We calculate the gradient of the parameters with

the sigmoid function, i.e. f (x) =

1+e

−x

, to minimize

the error. We experimented different gradient descent

algorithms such as Adam(Kingma and Ba, 2015) and

Adagrad(Duchi et al., 2011), AMSgrad (Reddi et al.,

2018) showed the best results. The AMSgrad pro-

ceeds to update the transformations parameters, i.e.

Equations (6) and (7). Usually, the gradient descent

algorithm would repeat the optimization steps until

the max-number of iterations or the ﬁnal performance

is reached. In our algorithm, we use a restart routine

whenever TNR and TPR metrics are below a thresh-

old, as a way to avoid over-ﬁtting of the accuracy met-

ric. When the algorithm stops, we select the curve

which resulted in the best MCC on the training set.

4.1 SurfOpt: Algorithm

This section presents the logic used in SurfOpt ap-

proach. Algorithm 1 describes the procedures per-

formed for this optimising the classiﬁcation surface.

Algorithm 1: SurfOpt.

S data points;

points on the curve;

current evaluation of the iteration i;

L labels of points;

c x-offset of the curve origin;

d y-offset of the curve origin;

θ rotation angle;

ω parameter, from Equation (2);

N true negative rate threshold;

P true positive rate threshold;

M maximum iterations for the gradient descent;

r resolution, number of points sampled in the curve;

α learning rate;

function SAMPLECURVEPOINTS(S, R, c, d, θ, ω)

Sample R points from a parabola within some range

beyond the data points, transformed to the parameters c,

d, θ, ω with Equations (6) and (7).

return C

end function

procedure RESTARTCURVEPOSITION(c,d,θ,b)

sets c (x-offset) as a random value within [1, −1]

times the sample standard deviation in x-axis;

sets d (y-offset) as a random value within [1, −1]

times the sample standard deviation in y-axis;

sets θ (rotation) with a random value within [2, 4]×π;

resets ω to its original value;

end procedure

function CLASSIFIEREVALUATION(S, C

)

Transforms data-points: every point inside the arc of

the curve deﬁned by Equation (2) is predicted to be of the

minority class. Otherwise, it predicts the point to belong

to the majority class.

return E

end function

function LEARNCURVE(S, L, N, P, M, b, R)

= ω

RESTARTCURVEPOSITION(c, d, θ, ω)

for i in 1 to M do

C = SAMPLECURVEPOINTS(S, R, c, d, θ, ω)

= CLASSIFIEREVALUATION(S, C

)

Calculate error e with Equation (10)

Saves curve and it’s evaluation

if TNR < N or TPR < P then

RESTARTCURVEPOSITION(c, d, θ, ω

)

else

for j in 1 to |C| do

∆c +=

∂

∂c



sigmoid(x

)]

∆d +=

∂

∂d



sigmoid(y

)]

∆ω +=

∂

∂ω



sigmoid(x

)] +

∂

∂ω



sigmoid(y

)]

∆θ +=

∂

∂θ



sigmoid(x

)] +

∂

∂θ



sigmoid(y

)]

end for

grad

= error × ∆c

grad

= error × ∆d

grad

= error × ∆ω

grad

= error × ∆θ

Update c, d, ω, θ

end if

end for

end function

SurfOpt: A New Surface Method for Optimizing the Classiﬁcation of Imbalanced Dataset

727

5 EVALUATION

This section presents the evaluation of the SurfOpt al-

gorithm, including the metrics analyzed, the experi-

ments executed and the results obtained for this work.

5.1 Skewed Dataset Classiﬁcation

One of the pervasive challenges of imbalanced classi-

ﬁcation is to evaluate models that create reliable pre-

dictions, and the classiﬁcation tasks, under those cir-

cumstances, have at least three common pitfalls:

I True Positive Rate (TPR) Detrimental: the

classiﬁer inﬂates the number of misclassiﬁcation

of the majority class, raising the number of cor-

rect predictions of the negative class. This cre-

ates more false negatives than true negatives, but

that is perceived as a positive effect on some met-

rics due to the fact they treat TPR and TNR as

equally important;

II Negative Predictive Value (NPV) detrimen-

tal: this is the rate of correct negative pre-

dicted events over all negative predictions. When

classes are severely skewed, the overall metrics

can show reasonable values. The true negative

rate may be high meaning most negative exam-

ples on the dataset are being correctly classiﬁed,

but the classiﬁer may be wrong most of the times

it predicts the negative class with just a minor ef-

fect on the overall metric (i.e. accuracy), ignor-

ing class imbalance;

III True Negative Rate (TNR) Detrimental, or

the safe bet: this is the inverse situation of

NPV Detrimental, where the classiﬁer may dis-

play even a perfect metric of negative predictive

value. Also, it can be always right when it pre-

dicts the minority class. Still, overall metrics

may point a useful classiﬁer, but it seldom pre-

dicts the minority class, leading to a very low true

negative rate.

To diagnose such problems of imbalanced classiﬁers,

we propose the joint observation of a set of three met-

rics: accuracy, negative predictive value, and true neg-

ative rate. Issues are evidenced in Figure 2.

ACC NPV TNR

TPR Detrimental

ACC NPV TNR

NPV Detrimental

ACC NPV TNR

TNR Detrimental

Figure 2: Challenges of imbalanced classiﬁcation.

5.2 Experiments

We used random sampling to get 5.000 data points

in Euclidean space from two normal distributions (A,

B), with unitary standard deviations. The samples

of distribution A consist of the majority class which

contains the most data points (97.5%). The sam-

ples of distribution B belong to the minority class

(2.5%). The distributions share all characteristics but

the mean values of x and y coordinates, the barycen-

ters of each class, have approximately unitary dis-

tance from each other. This means that the classes

overlap in space, as seen in real datasets. We re-

peated this 500 times to make 500 synthetic datasets.

For cross-validation, every dataset was split between

a training (70%) and a validation (30%) set.

Majority x Minority

Minority

Majority

Figure 3: Two populations of data points overlapping one

another, with an approximately unitary distance between

their barycenters. 97.5% of points belongs to the majority

class, 2.5% to the minority class. A curve decision surface

learned with the SurfOpt Algorithm separates these classes.

5.3 Results

Four classiﬁcation algorithms were run. The SurfOpt

algorithm was programmed in the R programming

language. For the SVM algorithm, we used the e1071

package (Dimitriadou et al., 2006). For Adabag and

Adaboost, we used the package Adabag (Alfaro et al.,

2013). All the parameters were found with grid-

search. For SurfOpt, we used ω = 0, TNR thresold =

0.35, TPR thresold = 0.10, max iterations = 2000,

resolution = 20. For the AMSmgrad optimization

algorithm, we used α = 0.9, β = 0.9, γ = 0.999.

For the SVM, we used a radial kernel cost = 1000,

γ = 0.1. For the experiments with Adaboost, we used

mﬁnal = 1 and the Breiman learning coefﬁcient. For

Adabag, we used mﬁnal = 1, max depth = 70.

ICAART 2019 - 11th International Conference on Agents and Artiﬁcial Intelligence

728

Table 1: Mean results for each algorithm.

Algorithm ACC NPV TNR MCC

SurfOpt 88.5% 11% 38.4% 0.156

SVM 97.5% 39.8% 0.7% 0.0409

Adaboost 97.3% 34.5% 7.4% 0.148

Adabag 97.3% 32.8% 8.5% 0.154

Table 1 shows the mean results for 500 synthetic

datasets. Note that the SurfOpt algorithm attains com-

parable performance in terms of mean Matthews Cor-

relation Coefﬁcient (MCC) to the ensemble meth-

ods of Adabag and Adaboost. Support Vector Ma-

chines shows little classiﬁcation value in terms of

mean MCC. The low mean MCC metric for the SVM

algorithm models can be explained by the fact it is

unable to deal with trade-off regions even with the

radial kernel, as it optimizes for distance to points.

As expected, the skewness made the SVM models bi-

ased to the class with most points. Although SurfOpt

algorithm shows a similar mean MCC to Adaboost

and Adabag ensemble methods, in our algorithm, the

mean True Negative Rate (TNR) is higher than the en-

semble methods, meaning that it correctly predicted a

higher ratio of the negative class data points. With

the observation of those three metrics, it can be iden-

tiﬁed the SurfOpt algorithm as a True positive rate

(TPR) detrimental algorithm, as it trades overall ac-

curacy to attain a better classiﬁcation of the minor-

ity class. SVM, Adaboost and Adabag classiﬁers are

True Negative Rate (TNR) detrimental, as they only

will predict the minority class if it is a very safe bet.

6 FUTURE WORK

We expect to investigate the properties of an en-

semble approach to the SurfOpt algorithm, using

sets of ununiform curves with complementary opti-

mization criteria, to compensate for individual clas-

siﬁer’s disadvantages. Other studies may also test

SurfOpt performance regarding oversampling and un-

dersampling approaches, explore the classiﬁer prop-

erties with noisy data, and ﬁnd a generalization to n-

dimensional spaces.

7 CONCLUSION

In this paper, we have introduced SurfOpt algorithm.

It brings together classiﬁcation performance with the

beneﬁts of optimizing a classiﬁcation surface. Op-

posed to common algorithms, SurfOpt algorithm can

better classify the minority class of a binary set even

under severe skewness. We devised a diagnostic to

classiﬁcation performance for imbalanced data based

on three basic metrics, as an effort to study classiﬁca-

tion under skewness. The source code of this work is

being made avaliable online (Da Silva, 2018).

ACKNOWLEDGMENT

The authors would like to thank the ﬁnancial support

from CAPES/Brazil and CNPq/Brazil, which was of-

fered through the project Abys (401364/2014-3).

REFERENCES

Alfaro, E., Gamez, M., Garcia, N., et al. (2013). Adabag:

An R Package For Classiﬁcation with Boosting and

Bagging. Journal of Statistical Software, 54(2):1–35.

Barber, C. B., Dobkin, D. P., and Huhdanpaa, H. (1996).

The Quickhull Algorithm For Convex Hulls. ACM

Transactions on Mathematical Software (TOMS),

22(4):469–483.

Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A

Training Algorithm for Optimal Margin Classiﬁers. In

Proceedings of the 5th annual workshop on Computa-

tional learning theory, pages 144–152. ACM.

Boughorbel, S., Jarray, F., and El-Anbari, M. (2017).

Optimal Classiﬁer for Imbalanced Data Using

Matthews Correlation Coefﬁcient Metric. PloS one,

12(6):e0177678.

Carneiro, N., Figueira, G., and Costa, M. (2017). A Data

Mining Based System For Credit-Card Fraud Detec-

tion In E-tail. Decision Support Systems, 95:91–101.

Da Silva, A. R. (2018). Surfopt.

https://github.com/andreblumenau/SurfOpt.

Devroye, L. and Toussaint, G. T. (1981). A Note on Linear

Expected Time Algorithms for Finding Convex Hulls.

Computing, 26(4):361–366.

Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D.,

Weingessel, A., and Leisch, M. F. (2006). The e1071

Package. Misc Functions of Department of Statistics

(e1071), TU Wien.

Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive sub-

gradient methods for online learning and stochastic

optimization. Journal of Machine Learning Research,

12(Jul):2121–2159.

Guo, P., Liu, T., Zhang, Q., Wang, L., Xiao, J., Zhang, Q.,

Luo, G., Li, Z., He, J., Zhang, Y., et al. (2017). Devel-

oping a Dengue Forecast Model using Machine Learn-

ing: A Case Study in China. PLoS neglected tropical

diseases, 11(10):e0005973.

Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue,

H., and Bing, G. (2017). Learning from Class-

imbalanced Data: Review of Methods and Applica-

tions. Expert Systems with Applications, 73:220–239.

Hofmann, M. (2006). Support Vector Machines—Kernels

and the Kernel Trick. Notes, 26.

SurfOpt: A New Surface Method for Optimizing the Classiﬁcation of Imbalanced Dataset

729

Kingma, D. P. and Ba, J. (2015). Adam: A Method For

Stochastic Optimization. 3rd International Confer-

ence for Learning Representations.

Lipton, Z. C., Elkan, C., and Naryanaswamy, B. (2014).

Optimal Thresholding of Classiﬁers To Maximize F1

Measure. In Joint European Conference on Machine

Learning and Knowledge Discovery in Databases,

pages 225–239. Springer.

Reddi, S. J., Kale, S., and Kumar, S. (2018). On the Conver-

gence of Adam and Beyond. 6th International Con-

ference on Learning Representations (ICLR).

Schapire, R. E. (2013). Explaining Adaboost. In Empirical

inference, pages 37–52. Springer.

Syam, N. and Sharma, A. (2018). Waiting for a Sales Re-

naissance in the Fourth Industrial Revolution: Ma-

chine Learning and Artiﬁcial Intelligence in Sales Re-

search and Practice. Industrial Marketing Manage-

ment, 69:135–146.

Zhao, Y., Healy, B. C., Rotstein, D., Guttmann, C. R., Bak-

shi, R., Weiner, H. L., Brodley, C. E., and Chitnis, T.

(2017). Exploration of Machine Learning Techniques

In Predicting Multiple Sclerosis Disease Course. PloS

one, 12(4):e0174866.

ICAART 2019 - 11th International Conference on Agents and Artiﬁcial Intelligence

730