are many methods for optimizing over hyperparam-
eter settings, ranging from simplistic procedures like
grid or random search (Bergstra et al., 2011; Bergstra
and Bengio, 2012), to more sophisticated model–
based approaches using random forests (Hutter et al.,
2011) or Gaussian processes (Snoek et al., 2012).
Hence, in the context of the text classification
problem, we can formulate a feature space composed
of: a) data set, b) hyperparameters of the classifica-
tion algorithms, and c) textual pre–processing opera-
tors. According to (Garey and Johnson, 1990) the task
of selecting the right features is nontrivial and check-
ing all possible combinations is an NP–complete task.
The task of selecting the classifier algorithm becomes
resource intensive for the ensemble of classifiers and
it requires expert knowledge from various areas of
Computer Science (Machine Learning, Natural Lan-
guage Processing, High Performance Computing).
In this research, we propose a methodology to deal
with this complex task using an optimization model
that explores the feature space with the goal of maxi-
mizing the ensemble of classifiers accuracy. We out-
line the optimization model and describe each step
and show that we can approximate the maximization
goal using a regression model.
2 RELATED WORKS
The task of optimizing classification algorithms hy-
perparameter is addressed by many authors (Bergstra
et al., 2011; Forman, 2003; Lim et al., 2000; Das-
gupta et al., 2007; Thornton et al., 2012). A large
study of classification algorithms shows that not only
accuracy of the algorithm depends on selected fea-
tures and input data, but training time, scalability and
interpretability of algorithm (Lim et al., 2000). An-
other research (Dasgupta et al., 2007) points out the
challenges associated with automated text, such as
a) choosing an appropriate data structure to represent
the documents, b) choosing an appropriate objective
function to optimize in order to avoid overfitting, c)
obtain good generalization, and d) dealing with algo-
rithmic issues arising because of the high formal di-
mensionality of the data. This last challenge can be
addressed via a prior selection of a subset of the fea-
tures available for describing the data (Dasgupta et al.,
2007). Such selection occurs before applying a learn-
ing algorithm and setting its operational parameters.
A large number of studies on feature selection have
focused on text domains both for binary and multi-
class problems. This fails to investigate the best possi-
ble accuracy obtainable for any single class (Forman,
2003).
Those studies deal with feature selection and pro-
vide an in–depth analysis of the problem of simulta-
neously selecting a learning algorithm and setting its
hyperparameters. In the work (Thornton et al., 2012)
researchers provide a tool that effectively identifies
machine learning algorithms and hyperparameter set-
tings. Proposed approaches still require high com-
putational resources to evaluate each model. Most
feature selection studies are conducted in a non-
automatic way or in semi-automatic way. This fails
to explore all possible features, attributes and algo-
rithms.
In the next chapter we will present a methodology
to build effective automatic feature and algorithms se-
lection model. These include testing platform per-
formance against several types of classification algo-
rithms, training datasets and document representation
method.
Thus, problem solution has to be designed con-
sidering all defined features. We propose optimiza-
tion model that depends on (i) the quality of the sam-
ple sets, (ii) on classification algorithm hyperparame-
ter and (iii) on the document representation (text pre-
processing). In this context, the complexity of the
task comes from the size of feature space and com-
puting resources needed to explore all domain. We do
note the past work (Bergstra et al., 2011; Luo, 2016;
Friedrichs and Igel, 2005; Snoek et al., 2012) that dis-
cuss more the theoretical aspects of optimization, pre-
senting algorithms, but not concrete implementations
on a distributed computing architecture.
We also note the recent work that gives an exten-
sive analysis of the domain (Schaer et al., 2016). This
work shows that HPC tools and frameworks avail-
able nowadays does not fit following requirments:
(i) provide a full simulation of optimization process,
(ii) address hyperparameter optimization directly and
(iii) provide implementations for classification algo-
rithms. As a result of the work (Schaer et al., 2016),
distributed implementation of hyperparameter opti-
mization for medical image classification was devel-
oped. Nevertheless, the same problem in the text clas-
sification domain remains open. Thus, in the chapter
4 we introduce MapReduce and MPJ hybrid architec-
ture of the fully automatic optmimization algorithm.
In the chapter 3 we provide implementation details as
news classification case study.
Presented arguments lets one conclude that the
effectiveness of classification algorithm depends on
the following parameters: (i) ML algorithm itself,
(ii) training data set and (iii) document representa-
tion (text pre-processing). Most ML algorithms fur-
ther expose hyperparameters, which change the way
the learning algorithm itself works. Hyperparameters
DATA 2018 - 7th International Conference on Data Science, Technology and Applications
172