Exploiting Correlation-based Metrics to Assess Encoding Techniques
Giuliano Armano and Emanuele Tamponi
Department of Electrical and Electronic Engineering, University of Cagliari, Cagliari, Italy
Keywords:
Supervised Learning, Correlation, Metrics, Performance, Encoding Techniques, Classification, Prediction.
Abstract:
The performance of a classification system depends on various aspects, including encoding techniques. In
fact, encoding techniques play a primary role in the process of tuning a classifier/predictor, as choosing the
most appropriate encoder may greatly affect its performance. As of now, evaluating the impact of an encoding
technique on a classification system typically requires to train the system and test it by means of a performance
metric deemed relevant (e.g., precision, recall, and Matthews correlation coefficients). For this reason, assess-
ing a single encoding technique is a time consuming activity, which introduces some additional degrees of
freedom (e.g., parameters of the training algorithm) that may be uncorrelated with the encoding technique to
be assessed. In this paper, we propose a family of methods to measure the performance of encoding techniques
used in classification tasks, based on the correlation between encoded input data and the corrisponding output.
The proposed approach provides correlation-based metrics, devised with the primary goal of focusing on the
encoding technique, leading other unrelated aspects apart. Notably, the proposed technique allows to save
computational time to a great extent, as it needs only a tiny fraction of the time required by standard methods.
1 INTRODUCTION
When facing a difficult classification or prediction
task (e.g., protein secondary structure prediction, face
recognition, fingerprint recognition), the corrispond-
ing system must be tuned with great care. Without
loss of generality,let us consider any such system as a
pipeline, consisting of two cascading parts: an encod-
ing module and a classifier/predictor. The encoding
module is fed with input data, so to provide the clas-
sifier/predictor with a properly encoded input data, so
to facilitate the learning task.
Choosing a good encoding technique is crucial to
improve the overall performance of a system. How-
ever, to our best knowledge, no specific methods have
been proposed to assess an encoding technique in iso-
lation from the corresponding classifier/predictor. In
fact, the system is typically considered as a whole,
and the overallperformanceis used as an indirect met-
ric to asses alternative encodings. This standard ap-
proach has some advantages; in particular, it provides
performance estimates of the final system. For exam-
ple, precision and recall have clear meaning, as well
as ROC curves and Matthews correlation coefficients.
It can be used to assess encoding techniques, accord-
ing to the following strategy: several systems, which
only differ for the encoding technique, can be tested
separately, giving rise to a comparative table that typ-
ically reports all performance metrics deemed rele-
vant. In presence of enough test data, one may assume
that statistical significance holds. Hence, it becomes
viable to assume that, if any changes in the perfor-
mance indices were observed, they should depend on
the encoder. According to the selected performance
metric, one may also generate a ranking of encoders.
Unfortunately,the above strategy has some impor-
tant drawbacks, the main one being that every per-
formance evaluation is highly time consuming, often
making unfeasible the test of many different encoding
techniques. For example, a 10-fold cross validation of
a system based on neural networks devised for protein
secondary structure prediction usually takes several
hours to complete. Now, assuming that the technique
in hand is parametric, finding the optimal value of the
parameter may require weeks or months to complete
(as, for every value of the parameter, an experiment
should be run). Another drawback is that the encod-
ing technique is not assessed in isolation, being part
of a pipeline. This introduces some degrees of free-
dom that are uncorrelated with the encoder, e.g., the
parameters of the learning algorithm, thus reducing
the confidence about statistical significance of exper-
imental results. A trivial solution to this problem is
to increase the number of trials; however, this ends up
308
Armano G. and Tamponi E. (2013).
Exploiting Correlation-based Metrics to Assess Encoding Techniques.
In Proceedings of the 2nd International Conference on Pattern Recognition Applications and Methods, pages 308-314
DOI: 10.5220/0004267503080314
Copyright
c
SciTePress